pyopencl-2025.1/.gitignore0000644000000000000000000000131014332717401012347 0ustar00_skbuild .pydevproject .project .settings *~ .*.sw[po] .sw[po] *.dat *.pyc build *.prof doc/hedge-notes.pdf *.vtk *.silo *.session dump.py *.orig /Makefile *.png tags *.vtu *.pvtu *.pvd doc/user-reference doc/dev-reference *.poly *.node *.bak *.pdf *.tif *.so *.pyd *.mpeg *-journal visitlog.py *.log .figleaf dist *.egg* MANIFEST *.patch *.LOCAL.[0-9]* *.REMOTE.[0-9]* *.BASE.[0-9]* tmp temp* setuptools.pth distribute-*.tar.gz core *.sess _build __pycache__ *.o .ipynb_checkpoints cscope.* # needed by jenkins env .env virtualenv-[0-9]* pytest.xml setuptools*tar.gz build-and-test-py-project.sh cffi_build.py .cache .pytest_cache .idea wheelhouse memray-*.bin memray-*.html .pylintrc.yml .run-pylint.py pyopencl-2025.1/.gitlab-ci.yml0000644000000000000000000001144714332717401013027 0ustar00variables: GIT_SUBMODULE_STRATEGY: recursive Python 3 Intel CPU: script: | source /opt/enable-intel-cl.sh export PYOPENCL_TEST="intel(r):pu" export EXTRA_INSTALL="numpy mako" curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv test_py_project tags: - python3 - intel-cl-cpu except: - tags artifacts: reports: junit: test/pytest.xml Python 3 Nvidia Titan X: script: | export PYOPENCL_TEST=nvi:titan export EXTRA_INSTALL="numpy mako" curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv test_py_project tags: - python3 - nvidia-titan-x except: - tags artifacts: reports: junit: test/pytest.xml Python 3 Nvidia Titan V: script: | export PYOPENCL_TEST=nvi:titan export EXTRA_INSTALL="numpy mako" curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv test_py_project tags: - python3 - nvidia-titan-v except: - tags artifacts: reports: junit: test/pytest.xml Python 3 Nvidia K40: script: | export PYOPENCL_TEST=nvi:k40 export EXTRA_INSTALL="numpy mako" curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv test_py_project tags: - python3 - nvidia-k40 except: - tags artifacts: reports: junit: test/pytest.xml Python 3 POCL: script: | export PYOPENCL_TEST=portable:cpu export EXTRA_INSTALL="numpy mako" curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv test_py_project tags: - python3 - pocl except: - tags artifacts: reports: junit: test/pytest.xml Python 3 POCL CL 1.1: script: | export PYOPENCL_TEST=portable:cpu export EXTRA_INSTALL="numpy mako" export PYOPENCL_PRETEND_CL_VERSION='1.1' curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv test_py_project tags: - python3 - pocl except: - tags artifacts: reports: junit: test/pytest.xml Python 3 POCL K40: script: | export PYOPENCL_TEST=portable:k40 export EXTRA_INSTALL="numpy mako" curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv test_py_project tags: - python3 - pocl - nvidia-k40 except: - tags artifacts: reports: junit: test/pytest.xml Python 3 POCL Titan V: script: | export PYOPENCL_TEST=portable:titan export EXTRA_INSTALL="numpy mako" curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv test_py_project tags: - python3 - pocl - nvidia-titan-v except: - tags artifacts: reports: junit: test/pytest.xml Python 3 POCL (+GL and special functions): script: | export PYOPENCL_TEST=portable:cpu export EXTRA_INSTALL="numpy mako scipy pyfmmlib" export PYOPENCL_ENABLE_GL=ON curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv test_py_project tags: - python3 - pocl except: - tags artifacts: reports: junit: test/pytest.xml Ruff: script: | pipx install ruff ruff check tags: - docker-runner except: - tags Pylint: script: | export EXTRA_INSTALL="numpy mako matplotlib PyOpenGl IPython" curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv # Avoid linting local directory, where native module # cannot be imported. rm -Rf "$(get_proj_name)" run_pylint "$(get_proj_name)" test/*.py tags: - python3 except: - tags Mypy: script: | export EXTRA_INSTALL="numpy mako mypy importlib-resources" curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv python -m mypy --show-error-codes pyopencl test tags: - python3 except: - tags Documentation: script: | export EXTRA_INSTALL="numpy mako" curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv build_docs maybe_upload_docs tags: - linux Examples: script: | export EXTRA_INSTALL="pillow cgen mako imageio" curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 build_py_project_in_venv (cd examples; rm -f gl_*) run_examples --no-require-main except: - tags tags: - python3 - pocl Downstream: parallel: matrix: - DOWNSTREAM_PROJECT: [loopy, boxtree, meshmode] tags: - large-node - docker-runner script: | curl -L -O https://tiker.net/ci-support-v0 . ci-support-v0 prepare_downstream_build "https://github.com/inducer/$DOWNSTREAM_PROJECT.git" sed -i 's/pyopencl/ocl-icd/' .test-conda-env-py3.yml build_py_project_in_conda_env test_py_project # vim: sw=2 pyopencl-2025.1/.gitmodules0000644000000000000000000000014214332717401012536 0ustar00[submodule "pyopencl/compyte"] path = pyopencl/compyte url = https://github.com/inducer/compyte pyopencl-2025.1/.pylintrc-local.yml0000644000000000000000000000016514332717401014123 0ustar00- arg: ignore val: compyte - arg: generated-members val: - cltypes.* - gl_platform.* - mako.template pyopencl-2025.1/.test-conda-env-py3.yml0000644000000000000000000000017214332717401014525 0ustar00name: test-conda-env channels: - conda-forge - nodefaults dependencies: - python=3 - git - numpy - ocl-icd - pocl - mako pyopencl-2025.1/CITATION.cff0000644000000000000000000000356614332717401012270 0ustar00cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Kloeckner" given-names: "Andreas" orcid: "https://orcid.org/0000-0003-1228-519X" - family-names: "Yu" given-names: "Yichao" - family-names: "Wala" given-names: "Matt" - family-names: "Fernando" given-names: "Isuru" - family-names: "Bencun" given-names: "Marko" - family-names: "Kulkarni" given-names: "Kaushik" - family-names: "Diener" given-names: "Matthias" - family-names: "Gao" given-names: "Hao" - family-names: "Fikl" given-names: "Alex" - family-names: "Weiner" given-names: "Zach" - family-names: "Weigert" given-names: "Martin" - family-names: "Palmer" given-names: "Rebecca" - family-names: "Latham" given-names: "Shane" - family-names: "Magno" given-names: "Gonçalo" - family-names: "Fuller" given-names: "Henry" - family-names: "Mackenzie" given-names: "Jonathan" - family-names: "Niarchos" given-names: "Sotiris" - family-names: "Gill" given-names: "Shahzaib" - family-names: "Gohlke" given-names: "Christoph" - family-names: "Bhosale" given-names: "Aditya" - family-names: "Rothberg" given-names: "Alex" - family-names: "Ey" given-names: "Emanuel" - family-names: "Rapp" given-names: "Holger" - family-names: "van der Walt" given-names: "Stefan" # Removed pending resolution of https://github.com/zenodo/zenodo/issues/2343 # - alias: "gw0" - family-names: "Thalhammer" given-names: "Gregor" - family-names: "Kieffer" given-names: "Jerome" - family-names: "Poliarnyi" given-names: "Nikolai" - family-names: "Bollinger" given-names: "Drew" - family-names: "Nitz" given-names: "Alex" - family-names: "Bokota" given-names: "Grzegorz" orcid: 'https://orcid.org/0000-0002-5470-1676' title: "PyOpenCL" version: 2022.1.3 doi: 10.5281/zenodo.6533956 date-released: 2022-03-10 url: "https://github.com/inducer/pyopencl" license: MIT pyopencl-2025.1/CMakeLists.txt0000644000000000000000000001727014332717401013133 0ustar00cmake_minimum_required(VERSION 3.17...3.26) project(pyopencl LANGUAGES CXX VERSION ${SKBUILD_PROJECT_VERSION}) if(NOT SKBUILD) message(WARNING "\ This CMake file is meant to be executed using 'scikit-build'. Running it directly will almost certainly not produce the desired result. If you are a user trying to install this package, please use the command below, which will install all necessary build dependencies, compile the package in an isolated environment, and then install it. ===================================================================== $ pip install . ===================================================================== If you are a software developer, and this is your own package, then it is usually much more efficient to install the build dependencies in your environment once and use the following command that avoids a costly creation of a new virtual environment at every compilation: ===================================================================== $ pip install nanobind scikit-build-core[pyproject] $ pip install --no-build-isolation -ve . ===================================================================== You may optionally add -Ceditable.rebuild=true to auto-rebuild when the package is imported. Otherwise, you need to re-run the above after editing C++ files.") endif() # {{{ Options option(PYOPENCL_TRACE "Enable OpenCL tracing" $ENV{PYOPENCL_TRACE}) option(PYOPENCL_ENABLE_GL "Enable OpenGL interoperability" $ENV{PYOPENCL_ENABLE_GL}) option(PYOPENCL_USE_SHIPPED_EXT "Use shipped CL extension header" $ENV{PYOPENCL_USE_SHIPPED_EXT}) set(CL_INC_DIR CACHE STRING "OpenCL include directory") set(CL_LIB_DIR CACHE STRING "OpenCL library directory") set(CL_LIBNAME CACHE STRING "OpenCL library name") set(PYOPENCL_PRETEND_CL_VERSION CACHE STRING "Pretend to be a different OpenCL version") if(NOT CL_INC_DIR) message(STATUS "CL_INC_DIR not set, trying to guess it from environment variables.") if(DEFINED ENV{CL_INC_DIR}) message(STATUS "Using OpenCL include directory from environment '$ENV{CL_INC_DIR}'") set(CL_INC_DIR $ENV{CL_INC_DIR}) endif() if(DEFINED ENV{CL_LIB_DIR}) message(STATUS "Using OpenCL library directory from environment '$ENV{CL_INC_DIR}'") set(CL_LIB_DIR $ENV{CL_LIB_DIR}) endif() if(DEFINED ENV{CL_LIBNAME}) message(STATUS "Using OpenCL library name from environment '$ENV{CL_LIBNAME}'") set(CL_LIBNAME $ENV{CL_LIBNAME}) endif() endif(NOT CL_INC_DIR) if(NOT CL_INC_DIR) message(STATUS "CL_INC_DIR not set, trying to guess it from conda environment.") if(DEFINED ENV{CONDA_PREFIX}) # Linux/MacOS: if(EXISTS $ENV{CONDA_PREFIX}/lib/libOpenCL${CMAKE_SHARED_LIBRARY_SUFFIX}) message(STATUS "Found OpenCL in conda environment '$ENV{CONDA_PREFIX}'") set(CL_INC_DIR $ENV{CONDA_PREFIX}/include) set(CL_LIB_DIR $ENV{CONDA_PREFIX}/lib) set(CL_LIBNAME OpenCL) # Windows: elseif(EXISTS $ENV{CONDA_PREFIX}/Library/lib/OpenCL${CMAKE_STATIC_LIBRARY_SUFFIX}) message(STATUS "Found OpenCL in conda environment '$ENV{CONDA_PREFIX}'") set(CL_INC_DIR $ENV{CONDA_PREFIX}/Library/include) set(CL_LIB_DIR $ENV{CONDA_PREFIX}/Library/lib) set(CL_LIBNAME OpenCL) endif() endif(DEFINED ENV{CONDA_PREFIX}) endif(NOT CL_INC_DIR) if(NOT PYOPENCL_PRETEND_CL_VERSION) if(DEFINED ENV{PYOPENCL_PRETEND_CL_VERSION}) set(PYOPENCL_PRETEND_CL_VERSION $ENV{PYOPENCL_PRETEND_CL_VERSION}) endif() endif() if(PYOPENCL_PRETEND_CL_VERSION) # Split the version string into a list string(REPLACE "." ";" VERSION_LIST ${PYOPENCL_PRETEND_CL_VERSION}) # Get the major and minor version numbers list(GET VERSION_LIST 0 MAJOR) list(GET VERSION_LIST 1 MINOR) # Calculate the numerical value math(EXPR ARG "0x1000*${MAJOR} + 0x10*${MINOR}") message(STATUS "Pretending to use OpenCL version ${PYOPENCL_PRETEND_CL_VERSION} (${ARG})") set(PYOPENCL_PRETEND_CL_VERSION ${ARG}) endif() message(STATUS "CL_INC_DIR ${CL_INC_DIR}") message(STATUS "CL_LIB_DIR ${CL_LIB_DIR}") message(STATUS "CL_LIBNAME ${CL_LIBNAME}") # }}} # {{{ Get version information find_program(GIT git) if(GIT AND EXISTS ${CMAKE_SOURCE_DIR}/.git) # Exact tag match => released version execute_process(COMMAND git describe --exact-match --dirty=* OUTPUT_VARIABLE PYOPENCL_VERSION_GIT RESULT_VARIABLE git_result OUTPUT_STRIP_TRAILING_WHITESPACE ERROR_QUIET ) if(NOT ${git_result} EQUAL 0) # No exact tag match => development version execute_process(COMMAND git describe --long --always --dirty=* OUTPUT_VARIABLE PYOPENCL_VERSION_GIT OUTPUT_STRIP_TRAILING_WHITESPACE ) set(PYOPENCL_REL "(dev)") else() set(PYOPENCL_REL "(release)") endif() else() set(PYOPENCL_VERSION_GIT "v${PROJECT_VERSION}") set(PYOPENCL_REL "(non-git)") endif() # }}} find_package(Python COMPONENTS Interpreter Development.Module NumPy REQUIRED) if (NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES) set(CMAKE_BUILD_TYPE Release CACHE STRING "Choose the type of build." FORCE) set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "MinSizeRel" "RelWithDebInfo") endif() # {{{ Detect nanobind and import it execute_process( COMMAND "${PYTHON_EXECUTABLE}" -c "import nanobind; print(nanobind.__version__)" OUTPUT_VARIABLE NANOBIND_VERSION OUTPUT_STRIP_TRAILING_WHITESPACE COMMAND_ECHO STDOUT) execute_process( COMMAND "${PYTHON_EXECUTABLE}" -c "import nanobind; print(nanobind.cmake_dir())" OUTPUT_VARIABLE NANOBIND_DIR OUTPUT_STRIP_TRAILING_WHITESPACE COMMAND_ECHO STDOUT) list(APPEND CMAKE_PREFIX_PATH "${NANOBIND_DIR}") # }}} link_directories(${CL_LIB_DIR}) include_directories(${CL_INC_DIR} ${Python_NumPy_INCLUDE_DIRS}) find_package(nanobind CONFIG REQUIRED) set(OpenCL_ROOT ${CL_LIB_DIR}) set(OpenCL_INCLUDE_DIR ${CL_INC_DIR}) set(OpenCL_LIBRARY ${CL_LIBNAME}) find_package(OpenCL REQUIRED) nanobind_add_module( _cl NB_STATIC # Build static libnanobind (the extension module itself remains a shared library) LTO NOMINSIZE src/wrap_constants.cpp src/wrap_cl.cpp src/wrap_cl_part_1.cpp src/wrap_cl_part_2.cpp src/wrap_mempool.cpp src/bitlog.cpp ) target_link_libraries(_cl PRIVATE ${OpenCL_LIBRARY}) target_compile_definitions(_cl PRIVATE PYGPU_PACKAGE=pyopencl PYGPU_PYOPENCL ) if (PYOPENCL_PRETEND_CL_VERSION) target_compile_definitions( _cl PRIVATE PYOPENCL_PRETEND_CL_VERSION=${PYOPENCL_PRETEND_CL_VERSION}) endif() if (PYOPENCL_ENABLE_GL) target_compile_definitions(_cl PRIVATE HAVE_GL=1) endif() if (PYOPENCL_TRACE) target_compile_definitions(_cl PRIVATE PYOPENCL_TRACE=1) endif() if (PYOPENCL_USE_SHIPPED_EXT) target_compile_definitions(_cl PRIVATE PYOPENCL_USE_SHIPPED_EXT=1) endif() install(TARGETS _cl LIBRARY DESTINATION pyopencl) # {{{ Print configuration message("==============================") message("PyOpenCL ${PYOPENCL_VERSION_GIT} ${PYOPENCL_REL} configuration: ") message(" PyOpenCL options: PYOPENCL_TRACE=${PYOPENCL_TRACE} PYOPENCL_ENABLE_GL=${PYOPENCL_ENABLE_GL} PYOPENCL_USE_SHIPPED_EXT=${PYOPENCL_USE_SHIPPED_EXT} PYOPENCL_PRETEND_CL_VERSION=${PYOPENCL_PRETEND_CL_VERSION}") message(" OpenCL: ${OpenCL_LIBRARIES} [${OpenCL_VERSION_STRING}]") message(" Python: ${Python_EXECUTABLE} [${Python_VERSION}]") message(" Build type: ${CMAKE_BUILD_TYPE}") message(" C++ compiler: ${CMAKE_CXX_COMPILER} [${CMAKE_CXX_COMPILER_ID} ${CMAKE_CXX_COMPILER_VERSION}]") message(" CMake: ${CMAKE_COMMAND} [${CMAKE_VERSION}]") message(" Nanobind: ${NANOBIND_DIR} [${NANOBIND_VERSION}]") message(" Build tool: ${CMAKE_MAKE_PROGRAM}") message("==============================") # }}} # vim: foldmethod=marker:sw=2 pyopencl-2025.1/LICENSE0000644000000000000000000000740414332717401011376 0ustar00PyOpenCL is licensed to you under the MIT/X Consortium license: Copyright (c) 2009-13 Andreas Klöckner and Contributors. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. PyOpenCL includes derivatives of parts of the `Thrust `_ computing package (in particular the scan implementation). These parts are licensed as follows: Copyright 2008-2011 NVIDIA Corporation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. .. note:: If you use Apache-licensed parts, be aware that these may be incompatible with software licensed exclusively under GPL2. (Most software is licensed as GPL2 or later, in which case this is not an issue.) PyOpenCL includes parts of the Random123 suite of random number generators: Copyright 2010-2012, D. E. Shaw Research. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of D. E. Shaw Research nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. pyopencl-2025.1/README.rst0000644000000000000000000000624114332717401012056 0ustar00PyOpenCL: Pythonic Access to OpenCL, with Arrays and Algorithms =============================================================== .. |badge-gitlab-ci| image:: https://gitlab.tiker.net/inducer/pyopencl/badges/main/pipeline.svg :alt: Gitlab Build Status :target: https://gitlab.tiker.net/inducer/pyopencl/commits/main .. |badge-github-ci| image:: https://github.com/inducer/pyopencl/workflows/CI/badge.svg?branch=main&event=push :alt: Github Build Status :target: https://github.com/inducer/pyopencl/actions?query=branch%3Amain+workflow%3ACI+event%3Apush .. |badge-pypi| image:: https://badge.fury.io/py/pyopencl.svg :alt: Python Package Index Release Page :target: https://pypi.org/project/pyopencl/ .. |badge-zenodo| image:: https://zenodo.org/badge/1575307.svg :alt: Zenodo DOI for latest release :target: https://zenodo.org/badge/latestdoi/1575307 |badge-gitlab-ci| |badge-github-ci| |badge-pypi| |badge-zenodo| PyOpenCL lets you access GPUs and other massively parallel compute devices from Python. It tries to offer computing goodness in the spirit of its sister project `PyCUDA `__: * Object cleanup tied to lifetime of objects. This idiom, often called `RAII `__ in C++, makes it much easier to write correct, leak- and crash-free code. * Completeness. PyOpenCL puts the full power of OpenCL's API at your disposal, if you wish. Every obscure ``get_info()`` query and all CL calls are accessible. * Automatic Error Checking. All CL errors are automatically translated into Python exceptions. * Speed. PyOpenCL's base layer is written in C++, so all the niceties above are virtually free. * Helpful and complete `Documentation `__ as well as a `Wiki `__. * Liberal license. PyOpenCL is open-source under the `MIT license `__ and free for commercial, academic, and private use. * Broad support. PyOpenCL was tested and works with Apple's, AMD's, and Nvidia's CL implementations. Simple 4-step `install instructions `__ using Conda on Linux and macOS (that also install a working OpenCL implementation!) can be found in the `documentation `__. What you'll need if you do *not* want to use the convenient instructions above and instead build from source: * g++/clang new enough to be compatible with nanobind (specifically, full support of C++17 is needed) * `numpy `__, and * an OpenCL implementation. (See this `howto `__ for how to get one.) Links ----- * `Documentation `__ (read how things work) * `Python package index `__ (download releases, including binary wheels for Linux, macOS, Windows) * `Conda Forge `__ (download binary packages for Linux, macOS, Windows) * `Github `__ (get latest source code, file bugs) pyopencl-2025.1/contrib/cldis.py0000644000000000000000000000706514332717401013504 0ustar00__copyright__ = "Copyright (C) 2022 Isuru Fernando" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ """ cldis.py A script to compile and print the native code for a OpenCL kernel. Usage: python cldis.py prog.cl """ import glob import os import re import subprocess import sys import tempfile def main(ctx, tmp_dir, cl_str, output=None, build_options=()): device = ctx.devices[0] platform = device.platform if platform.name == "NVIDIA CUDA": supported_outputs = ["ptx", "sass"] elif platform.name == "Portable Computing Language": if device.name.startswith("NVIDIA"): supported_outputs = ["ptx", "sass"] elif device.name.startswith("pthread") or device.name.startswith("cpu"): supported_outputs = ["asm"] else: raise NotImplementedError(f"Unknown pocl device '{device.name}'") else: raise NotImplementedError(f"Unknown opencl device '{device}'") if output is None: output = supported_outputs[0] else: assert output in supported_outputs prg = cl.Program(ctx, cl_str).build(options=build_options, cache_dir=os.path.join(tmp_dir, "cache")) for binary in prg.binaries: if output in ["ptx", "sass"]: res = binary[binary.index(b"// Generated"):].decode("utf-8") if output == "sass": with open(os.path.join(tmp_dir, "cl.ptx"), "w") as f: f.write(res) tgt = re.findall(r".target sm_[0-9]*", res, re.MULTILINE)[0] gpu_name = tgt[8:] subprocess.check_call(["ptxas", "cl.ptx", "--verbose", f"--gpu-name={gpu_name}", "--warn-on-spills"], cwd=tmp_dir) res = subprocess.check_output(["cuobjdump", "-sass", "elf.o"], cwd=tmp_dir).decode("utf-8") elif output == "asm" and platform.name == "Portable Computing Language": so = glob.glob(f"{tmp_dir}/**/*.so", recursive=True)[0] res = subprocess.check_output(["objdump", "-d", so]).decode("utf-8") print(res) if __name__ == "__main__": with tempfile.TemporaryDirectory() as tmp_dir: os.environ["POCL_CACHE_DIR"] = os.path.join(tmp_dir, "pocl_cache") import pyopencl as cl ctx = cl.create_some_context() cl_file = sys.argv[1] with open(cl_file) as f: cl_str = f.read() output = sys.argv[2] if len(sys.argv) >= 3 else None build_options = sys.argv[3:] if len(sys.argv) >= 4 else [] main(ctx, tmp_dir, cl_str, output, build_options) pyopencl-2025.1/contrib/fortran-to-opencl/README0000644000000000000000000000157114332717401016261 0ustar00Experimental Fortran-to-OpenCL translator ----------------------------------------- This is a highly experimental Fortran-to-OpenCL translator. Its purpose is to translate computational kernels into OpenCL-like C. It doesn't auto-parallelize. My purpose in writing this was to convert a few special-function evaluators. The best it can hope for at the moment is to automate most of the process so that you'll only have to fix up a few things manually afterwards. It further only deals with the subset of Fortran 77 that I needed. Quite a number of things are unimplemented. Patches are welcome. Andreas Kloeckner Dependencies: - cnd http://github.com/inducer/cnd - cgen http://github.com/inducer/cgen - pymbolic http://github.com/inducer/pymbolic - fparser http://code.google.com/p/f2py with fix from http://code.google.com/p/f2py/issues/detail?id=32 pyopencl-2025.1/contrib/fortran-to-opencl/translate.py0000644000000000000000000012544614332717401017760 0ustar00__copyright__ = "Copyright (C) 2009 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import re from sys import intern from typing import ClassVar, Dict, List, Tuple from warnings import warn import numpy as np import cgen import pymbolic.primitives as p import pytools.lex from pymbolic.mapper import CombineMapper from pymbolic.mapper.c_code import CCodeMapper as CCodeMapperBase from pymbolic.parser import Parser as ExpressionParserBase class TranslatorWarning(UserWarning): pass class TranslationError(RuntimeError): pass def complex_type_name(dtype): if dtype == np.complex64: return "cfloat" if dtype == np.complex128: return "cdouble" else: raise RuntimeError # {{{ AST components def dtype_to_ctype(dtype): if dtype is None: raise ValueError("dtype may not be None") dtype = np.dtype(dtype) if dtype == np.int64: return "long" elif dtype == np.uint64: return "unsigned long" elif dtype == np.int32: return "int" elif dtype == np.uint32: return "unsigned int" elif dtype == np.int16: return "short int" elif dtype == np.uint16: return "short unsigned int" elif dtype == np.int8: return "signed char" elif dtype == np.uint8: return "unsigned char" elif dtype == np.float32: return "float" elif dtype == np.float64: return "double" elif dtype == np.complex64: return "cfloat_t" elif dtype == np.complex128: return "cdouble_t" else: raise ValueError("unable to map dtype '%s'" % dtype) class POD(cgen.POD): def get_decl_pair(self): return [dtype_to_ctype(self.dtype)], self.name # }}} # {{{ expression parser _less_than = intern("less_than") _greater_than = intern("greater_than") _less_equal = intern("less_equal") _greater_equal = intern("greater_equal") _equal = intern("equal") _not_equal = intern("not_equal") _not = intern("not") _and = intern("and") _or = intern("or") class TypedLiteral(p.Leaf): def __init__(self, value, dtype): self.value = value self.dtype = np.dtype(dtype) def __getinitargs__(self): return self.value, self.dtype mapper_method = intern("map_literal") def simplify_typed_literal(expr): if (isinstance(expr, p.Product) and len(expr.children) == 2 and isinstance(expr.children[1], TypedLiteral) and p.is_constant(expr.children[0]) and expr.children[0] == -1): tl = expr.children[1] return TypedLiteral("-"+tl.value, tl.dtype) else: return expr class FortranExpressionParser(ExpressionParserBase): # FIXME double/single prec literals lex_table: ClassVar[List[Tuple[str, str]]] = [ (_less_than, pytools.lex.RE(r"\.lt\.", re.I)), (_greater_than, pytools.lex.RE(r"\.gt\.", re.I)), (_less_equal, pytools.lex.RE(r"\.le\.", re.I)), (_greater_equal, pytools.lex.RE(r"\.ge\.", re.I)), (_equal, pytools.lex.RE(r"\.eq\.", re.I)), (_not_equal, pytools.lex.RE(r"\.ne\.", re.I)), (_not, pytools.lex.RE(r"\.not\.", re.I)), (_and, pytools.lex.RE(r"\.and\.", re.I)), (_or, pytools.lex.RE(r"\.or\.", re.I)), *ExpressionParserBase.lex_table] def __init__(self, tree_walker): self.tree_walker = tree_walker _PREC_FUNC_ARGS = 1 def parse_terminal(self, pstate): scope = self.tree_walker.scope_stack[-1] from pymbolic.parser import _closepar, _float, _identifier, _openpar next_tag = pstate.next_tag() if next_tag is _float: value = pstate.next_str_and_advance().lower() if "d" in value: dtype = np.float64 else: dtype = np.float32 value = value.replace("d", "e") if value.startswith("."): prev_value = value value = "0"+value print(value, prev_value) elif value.startswith("-."): prev_value = value value = "-0"+value[1:] print(value, prev_value) return TypedLiteral(value, dtype) elif next_tag is _identifier: name = pstate.next_str_and_advance() if pstate.is_at_end() or pstate.next_tag() is not _openpar: # not a subscript scope.use_name(name) return p.Variable(name) left_exp = p.Variable(name) pstate.advance() pstate.expect_not_end() if scope.is_known(name): cls = p.Subscript else: cls = p.Call if pstate.next_tag is _closepar: pstate.advance() left_exp = cls(left_exp, ()) else: args = self.parse_expression(pstate, self._PREC_FUNC_ARGS) if not isinstance(args, tuple): args = (args,) left_exp = cls(left_exp, args) pstate.expect(_closepar) pstate.advance() return left_exp else: return ExpressionParserBase.parse_terminal( self, pstate) COMP_MAP: ClassVar[Dict[str, str]] = { _less_than: "<", _less_equal: "<=", _greater_than: ">", _greater_equal: ">=", _equal: "==", _not_equal: "!=", } def parse_prefix(self, pstate, min_precedence=0): import pymbolic.primitives as primitives from pymbolic.parser import _PREC_UNARY pstate.expect_not_end() if pstate.is_next(_not): pstate.advance() return primitives.LogicalNot( self.parse_expression(pstate, _PREC_UNARY)) else: return ExpressionParserBase.parse_prefix(self, pstate) def parse_postfix(self, pstate, min_precedence, left_exp): from pymbolic.parser import ( _PREC_CALL, _PREC_COMPARISON, _PREC_LOGICAL_AND, _PREC_LOGICAL_OR, _openpar, ) from pymbolic.primitives import Comparison, LogicalAnd, LogicalOr next_tag = pstate.next_tag() if next_tag is _openpar and _PREC_CALL > min_precedence: raise TranslationError("parenthesis operator only works on names") elif next_tag in self.COMP_MAP and _PREC_COMPARISON > min_precedence: pstate.advance() left_exp = Comparison( left_exp, self.COMP_MAP[next_tag], self.parse_expression(pstate, _PREC_COMPARISON)) did_something = True elif next_tag is _and and _PREC_LOGICAL_AND > min_precedence: pstate.advance() left_exp = LogicalAnd((left_exp, self.parse_expression(pstate, _PREC_LOGICAL_AND))) did_something = True elif next_tag is _or and _PREC_LOGICAL_OR > min_precedence: pstate.advance() left_exp = LogicalOr((left_exp, self.parse_expression(pstate, _PREC_LOGICAL_OR))) did_something = True else: left_exp, did_something = ExpressionParserBase.parse_postfix( self, pstate, min_precedence, left_exp) if isinstance(left_exp, tuple) and min_precedence < self._PREC_FUNC_ARGS: # this must be a complex literal assert len(left_exp) == 2 r, i = left_exp r = simplify_typed_literal(r) i = simplify_typed_literal(i) dtype = (r.dtype.type(0) + i.dtype.type(0)).dtype if dtype == np.float32: dtype = np.complex64 else: dtype = np.complex128 left_exp = TypedLiteral(left_exp, dtype) return left_exp, did_something # }}} # {{{ expression generator class TypeInferenceMapper(CombineMapper): def __init__(self, scope): self.scope = scope def combine(self, dtypes): return sum(dtype.type(1) for dtype in dtypes).dtype def map_literal(self, expr): return expr.dtype def map_constant(self, expr): return np.asarray(expr).dtype def map_variable(self, expr): return self.scope.get_type(expr.name) def map_call(self, expr): name = expr.function.name if name == "fromreal": arg, = expr.parameters base_dtype = self.rec(arg) tgt_real_dtype = (np.float32(0)+base_dtype.type(0)).dtype assert tgt_real_dtype.kind == "f" if tgt_real_dtype == np.float32: return np.dtype(np.complex64) elif tgt_real_dtype == np.float64: return np.dtype(np.complex128) else: raise RuntimeError("unexpected complex type") elif name in ["imag", "real", "abs", "dble"]: arg, = expr.parameters base_dtype = self.rec(arg) if base_dtype == np.complex128: return np.dtype(np.float64) elif base_dtype == np.complex64: return np.dtype(np.float32) else: return base_dtype else: return CombineMapper.map_call(self, expr) class ComplexCCodeMapper(CCodeMapperBase): def __init__(self, infer_type): CCodeMapperBase.__init__(self) self.infer_type = infer_type def map_sum(self, expr, enclosing_prec): tgt_dtype = self.infer_type(expr) is_complex = tgt_dtype.kind == "c" if not is_complex: return CCodeMapperBase.map_sum(self, expr, enclosing_prec) else: tgt_name = complex_type_name(tgt_dtype) reals = [child for child in expr.children if "c" != self.infer_type(child).kind] complexes = [child for child in expr.children if "c" == self.infer_type(child).kind] from pymbolic.mapper.stringifier import PREC_NONE, PREC_SUM real_sum = self.join_rec(" + ", reals, PREC_SUM) if len(complexes) == 1: myprec = PREC_SUM else: myprec = PREC_NONE complex_sum = self.rec(complexes[0], myprec) for child in complexes[1:]: complex_sum = "{}_add({}, {})".format( tgt_name, complex_sum, self.rec(child, PREC_NONE)) if real_sum: result = "{}_add({}_fromreal({}), {})".format( tgt_name, tgt_name, real_sum, complex_sum) else: result = complex_sum return self.parenthesize_if_needed(result, enclosing_prec, PREC_SUM) def map_product(self, expr, enclosing_prec): tgt_dtype = self.infer_type(expr) is_complex = "c" == tgt_dtype.kind if not is_complex: return CCodeMapperBase.map_product(self, expr, enclosing_prec) else: tgt_name = complex_type_name(tgt_dtype) reals = [child for child in expr.children if "c" != self.infer_type(child).kind] complexes = [child for child in expr.children if "c" == self.infer_type(child).kind] from pymbolic.mapper.stringifier import PREC_NONE, PREC_PRODUCT real_prd = self.join_rec("*", reals, PREC_PRODUCT) if len(complexes) == 1: myprec = PREC_PRODUCT else: myprec = PREC_NONE complex_prd = self.rec(complexes[0], myprec) for child in complexes[1:]: complex_prd = "{}_mul({}, {})".format( tgt_name, complex_prd, self.rec(child, PREC_NONE)) if real_prd: result = f"{tgt_name}_rmul({real_prd}, {complex_prd})" else: result = complex_prd return self.parenthesize_if_needed(result, enclosing_prec, PREC_PRODUCT) def map_quotient(self, expr, enclosing_prec): from pymbolic.mapper.stringifier import PREC_NONE n_complex = "c" == self.infer_type(expr.numerator).kind d_complex = "c" == self.infer_type(expr.denominator).kind tgt_dtype = self.infer_type(expr) if not (n_complex or d_complex): return CCodeMapperBase.map_quotient(self, expr, enclosing_prec) elif n_complex and not d_complex: return "{}_divider({}, {})".format( complex_type_name(tgt_dtype), self.rec(expr.numerator, PREC_NONE), self.rec(expr.denominator, PREC_NONE)) elif not n_complex and d_complex: return "{}_rdivide({}, {})".format( complex_type_name(tgt_dtype), self.rec(expr.numerator, PREC_NONE), self.rec(expr.denominator, PREC_NONE)) else: return "{}_divide({}, {})".format( complex_type_name(tgt_dtype), self.rec(expr.numerator, PREC_NONE), self.rec(expr.denominator, PREC_NONE)) def map_remainder(self, expr, enclosing_prec): tgt_dtype = self.infer_type(expr) if "c" == tgt_dtype.kind: raise RuntimeError("complex remainder not defined") return CCodeMapperBase.map_remainder(self, expr, enclosing_prec) def map_power(self, expr, enclosing_prec): from pymbolic.mapper.stringifier import PREC_NONE tgt_dtype = self.infer_type(expr) if "c" == tgt_dtype.kind: if expr.exponent in [2, 3, 4]: value = expr.base for i in range(expr.exponent-1): value = value * expr.base return self.rec(value, enclosing_prec) else: b_complex = "c" == self.infer_type(expr.base).kind e_complex = "c" == self.infer_type(expr.exponent).kind if b_complex and not e_complex: return "{}_powr({}, {})".format( complex_type_name(tgt_dtype), self.rec(expr.base, PREC_NONE), self.rec(expr.exponent, PREC_NONE)) else: return "{}_pow({}, {})".format( complex_type_name(tgt_dtype), self.rec(expr.base, PREC_NONE), self.rec(expr.exponent, PREC_NONE)) return CCodeMapperBase.map_power(self, expr, enclosing_prec) class CCodeMapper(ComplexCCodeMapper): # Whatever is needed to mop up after Fortran goes here. # Stuff that deals with generating real-valued code # from complex code goes above. def __init__(self, translator, scope): ComplexCCodeMapper.__init__(self, scope.get_type_inference_mapper()) self.translator = translator self.scope = scope def map_subscript(self, expr, enclosing_prec): idx_dtype = self.infer_type(expr.index) if not "i" == idx_dtype.kind or "u" == idx_dtype.kind: ind_prefix = "(int) " else: ind_prefix = "" idx = expr.index if isinstance(idx, tuple) and len(idx) == 1: idx, = idx from pymbolic.mapper.stringifier import PREC_CALL, PREC_NONE return self.parenthesize_if_needed( self.format("%s[%s%s]", self.scope.translate_var_name(expr.aggregate.name), ind_prefix, self.rec(idx, PREC_NONE)), enclosing_prec, PREC_CALL) def map_call(self, expr, enclosing_prec): from pymbolic.mapper.stringifier import PREC_NONE tgt_dtype = self.infer_type(expr) arg_dtypes = [self.infer_type(par) for par in expr.parameters] name = expr.function.name if "f" == tgt_dtype.kind and name == "abs": name = "fabs" elif "c" == tgt_dtype.kind: if name in ["conjg", "dconjg"]: name = "conj" if name[:2] == "cd" and name[2:] in ["log", "exp", "sqrt"]: name = name[2:] if name == "dble": name = "real" name = "{}_{}".format( complex_type_name(tgt_dtype), name) elif name in ["aimag", "real", "imag"] and tgt_dtype.kind == "f": arg_dtype, = arg_dtypes if name == "aimag": name = "imag" name = "{}_{}".format( complex_type_name(arg_dtype), name) elif "c" == tgt_dtype.kind and name == "abs": arg_dtype, = arg_dtypes name = "%s_abs" % ( complex_type_name(arg_dtype)) return self.format("%s(%s)", name, self.join_rec(", ", expr.parameters, PREC_NONE)) def map_variable(self, expr, enclosing_prec): # guaranteed to not be a subscript or a call name = expr.name shape = self.scope.get_shape(name) name = self.scope.translate_var_name(name) if expr.name in self.scope.arg_names: arg_idx = self.scope.arg_names.index(name) if self.translator.arg_needs_pointer( self.scope.subprogram_name, arg_idx): return "*"+name else: return name elif shape not in [(), None]: return "*"+name else: return name def map_literal(self, expr, enclosing_prec): from pymbolic.mapper.stringifier import PREC_NONE if expr.dtype.kind == "c": r, i = expr.value return "{}_new({}, {})".format( complex_type_name(expr.dtype), self.rec(r, PREC_NONE), self.rec(i, PREC_NONE)) else: return expr.value def map_wildcard(self, expr, enclosing_prec): return ":" # }}} class Scope: def __init__(self, subprogram_name, arg_names=set()): self.subprogram_name = subprogram_name # map name to data self.data_statements = {} # map first letter to type self.implicit_types = {} # map name to dim tuple self.dim_map = {} # map name to dim tuple self.type_map = {} # map name to data self.data = {} self.arg_names = arg_names self.used_names = set() self.type_inf_mapper = None def known_names(self): return (self.used_names | set(self.dim_map.keys()) | set(self.type_map.keys())) def is_known(self, name): return (name in self.used_names or name in self.dim_map or name in self.type_map) def use_name(self, name): self.used_names.add(name) def get_type(self, name): try: return self.type_map[name] except KeyError: if self.implicit_types is None: raise TranslationError( "no type for '%s' found in implicit none routine" % name) return self.implicit_types.get(name[0], np.dtype(np.int32)) def get_shape(self, name): return self.dim_map.get(name, ()) def get_type_inference_mapper(self): if self.type_inf_mapper is None: self.type_inf_mapper = TypeInferenceMapper(self) return self.type_inf_mapper def translate_var_name(self, name): shape = self.dim_map.get(name) if name in self.data and shape is not None: return f"{self.subprogram_name}_{name}" else: return name class FTreeWalkerBase: def __init__(self): self.scope_stack = [] self.expr_parser = FortranExpressionParser(self) def rec(self, expr, *args, **kwargs): mro = list(type(expr).__mro__) dispatch_class = kwargs.pop("dispatch_class", type(self)) while mro: method_name = "map_"+mro.pop(0).__name__ try: method = getattr(dispatch_class, method_name) except AttributeError: pass else: return method(self, expr, *args, **kwargs) raise NotImplementedError( "%s does not know how to map type '%s'" % (type(self).__name__, type(expr))) ENTITY_RE = re.compile( r"^(?P[_0-9a-zA-Z]+)" r"(\((?P[-+*0-9:a-zA-Z,]+)\))?$") def parse_dimension_specs(self, dim_decls): def parse_bounds(bounds_str): start_end = bounds_str.split(":") assert 1 <= len(start_end) <= 2 return tuple(self.parse_expr(s) for s in start_end) for decl in dim_decls: entity_match = self.ENTITY_RE.match(decl) assert entity_match groups = entity_match.groupdict() name = groups["name"] assert name if groups["shape"]: shape = [parse_bounds(s) for s in groups["shape"].split(",")] else: shape = None yield name, shape def __call__(self, expr, *args, **kwargs): return self.rec(expr, *args, **kwargs) # {{{ expressions def parse_expr(self, expr_str): return self.expr_parser(expr_str) # }}} class ArgumentAnalayzer(FTreeWalkerBase): def __init__(self): FTreeWalkerBase.__init__(self) # map (func, arg_nr) to # "w" for "needs pointer" # [] for no obstacle to de-pointerification known # [(func_name, arg_nr), ...] # depends on how this arg is used self.arg_usage_info = {} def arg_needs_pointer(self, func, arg_nr): data = self.arg_usage_info.get((func, arg_nr), []) if isinstance(data, list): return any( self.arg_needs_pointer(sub_func, sub_arg_nr) for sub_func, sub_arg_nr in data) return True # {{{ map_XXX functions def map_BeginSource(self, node): scope = Scope(None) self.scope_stack.append(scope) for c in node.content: self.rec(c) def map_Subroutine(self, node): scope = Scope(node.name, list(node.args)) self.scope_stack.append(scope) for c in node.content: self.rec(c) self.scope_stack.pop() def map_EndSubroutine(self, node): pass def map_Implicit(self, node): pass # {{{ types, declarations def map_Equivalence(self, node): raise NotImplementedError("equivalence") def map_Dimension(self, node): scope = self.scope_stack[-1] for name, shape in self.parse_dimension_specs(node.items): if name in scope.arg_names: arg_idx = scope.arg_names.index(name) self.arg_usage_info[scope.subprogram_name, arg_idx] = "w" def map_External(self, node): pass def map_type_decl(self, node): scope = self.scope_stack[-1] for name, shape in self.parse_dimension_specs(node.entity_decls): if shape is not None and name in scope.arg_names: arg_idx = scope.arg_names.index(name) self.arg_usage_info[scope.subprogram_name, arg_idx] = "w" map_Logical = map_type_decl map_Integer = map_type_decl map_Real = map_type_decl map_Complex = map_type_decl # }}} def map_Data(self, node): pass def map_Parameter(self, node): raise NotImplementedError("parameter") # {{{ I/O def map_Open(self, node): pass def map_Format(self, node): pass def map_Write(self, node): pass def map_Print(self, node): pass def map_Read1(self, node): pass # }}} def map_Assignment(self, node): scope = self.scope_stack[-1] lhs = self.parse_expr(node.variable) if isinstance(lhs, p.Subscript): lhs_name = lhs.aggregate.name elif isinstance(lhs, p.Call): # in absence of dim info, subscripts get parsed as calls lhs_name = lhs.function.name else: lhs_name = lhs.name if lhs_name in scope.arg_names: arg_idx = scope.arg_names.index(lhs_name) self.arg_usage_info[scope.subprogram_name, arg_idx] = "w" def map_Allocate(self, node): raise NotImplementedError("allocate") def map_Deallocate(self, node): raise NotImplementedError("deallocate") def map_Save(self, node): raise NotImplementedError("save") def map_Line(self, node): raise NotImplementedError def map_Program(self, node): raise NotImplementedError def map_Entry(self, node): raise NotImplementedError # {{{ control flow def map_Goto(self, node): pass def map_Call(self, node): scope = self.scope_stack[-1] for i, arg_str in enumerate(node.items): arg = self.parse_expr(arg_str) if isinstance(arg, (p.Variable, p.Subscript)): if isinstance(arg, p.Subscript): arg_name = arg.aggregate.name else: arg_name = arg.name if arg_name in scope.arg_names: arg_idx = scope.arg_names.index(arg_name) arg_usage = self.arg_usage_info.setdefault( (scope.subprogram_name, arg_idx), []) if isinstance(arg_usage, list): arg_usage.append((node.designator, i)) def map_Return(self, node): pass def map_ArithmeticIf(self, node): pass def map_If(self, node): for c in node.content: self.rec(c) def map_IfThen(self, node): for c in node.content: self.rec(c) def map_ElseIf(self, node): pass def map_Else(self, node): pass def map_EndIfThen(self, node): pass def map_Do(self, node): for c in node.content: self.rec(c) def map_EndDo(self, node): pass def map_Continue(self, node): pass def map_Stop(self, node): pass def map_Comment(self, node): pass # }}} # }}} # {{{ translator class F2CLTranslator(FTreeWalkerBase): def __init__(self, addr_space_hints, force_casts, arg_info, use_restrict_pointers): FTreeWalkerBase.__init__(self) self.addr_space_hints = addr_space_hints self.force_casts = force_casts self.arg_info = arg_info self.use_restrict_pointers = use_restrict_pointers def arg_needs_pointer(self, subprogram_name, arg_index): return self.arg_info.arg_needs_pointer(subprogram_name, arg_index) # {{{ declaration helpers def get_declarator(self, name): scope = self.scope_stack[-1] return POD(scope.get_type(name), name) def get_declarations(self): scope = self.scope_stack[-1] result = [] pre_func_decl = [] def gen_shape(start_end): return ":".join(self.gen_expr(s) for s in start_end) for name in sorted(scope.known_names()): shape = scope.dim_map.get(name) if shape is not None: dim_stmt = cgen.Statement( 'dimension "fortran" {}[{}]'.format( scope.translate_var_name(name), ", ".join(gen_shape(s) for s in shape) )) # cannot omit "dimension" decl even for rank-1 args: result.append(dim_stmt) if name in scope.data: assert name not in scope.arg_names data = scope.data[name] if shape is None: assert len(data) == 1 result.append( cgen.Initializer( self.get_declarator(name), self.gen_expr(data[0]) )) else: from cgen.opencl import CLConstant pre_func_decl.append( cgen.Initializer( CLConstant( cgen.ArrayOf(self.get_declarator( f"{scope.subprogram_name}_{name}"))), "{ %s }" % ",\n".join(self.gen_expr(x) for x in data) )) else: if name not in scope.arg_names: if shape is not None: result.append(cgen.Statement( "%s %s[nitemsof(%s)]" % ( dtype_to_ctype(scope.get_type(name)), name, name))) else: result.append(self.get_declarator(name)) return pre_func_decl, result def map_statement_list(self, content): body = [] for c in content: mapped = self.rec(c) if mapped is None: warn("mapping '%s' returned None" % type(c)) elif isinstance(mapped, list): body.extend(mapped) else: body.append(mapped) return body # }}} # {{{ map_XXX functions def map_BeginSource(self, node): scope = Scope(None) self.scope_stack.append(scope) return self.map_statement_list(node.content) def map_Subroutine(self, node): assert not node.prefix assert not hasattr(node, "suffix") scope = Scope(node.name, list(node.args)) self.scope_stack.append(scope) body = self.map_statement_list(node.content) pre_func_decl, in_func_decl = self.get_declarations() body = [*in_func_decl, cgen.Line(), *body] if isinstance(body[-1], cgen.Statement) and body[-1].text == "return": body.pop() def get_arg_decl(arg_idx, arg_name): decl = self.get_declarator(arg_name) if self.arg_needs_pointer(node.name, arg_idx): hint = self.addr_space_hints.get((node.name, arg_name)) if hint: decl = hint(cgen.Pointer(decl)) else: if self.use_restrict_pointers: decl = cgen.RestrictPointer(decl) else: decl = cgen.Pointer(decl) return decl result = cgen.FunctionBody( cgen.FunctionDeclaration( cgen.Value("void", node.name), [get_arg_decl(i, arg) for i, arg in enumerate(node.args)] ), cgen.Block(body)) self.scope_stack.pop() if pre_func_decl: return [*pre_func_decl, cgen.Line(), result] else: return result def map_EndSubroutine(self, node): return [] def map_Implicit(self, node): scope = self.scope_stack[-1] if not node.items: assert not scope.implicit_types scope.implicit_types = None for stmt, specs in node.items: tp = self.dtype_from_stmt(stmt) for start, end in specs: for char_code in range(ord(start), ord(end)+1): scope.implicit_types[chr(char_code)] = tp return [] # {{{ types, declarations def map_Equivalence(self, node): raise NotImplementedError("equivalence") TYPE_MAP: ClassVar[Dict[Tuple[str, str], np.generic]] = { ("real", "4"): np.float32, ("real", "8"): np.float64, ("real", "16"): np.float128, ("complex", "8"): np.complex64, ("complex", "16"): np.complex128, ("complex", "32"): np.complex256, ("integer", ""): np.int32, ("integer", "4"): np.int32, ("integer", "8"): np.int64, } def dtype_from_stmt(self, stmt): length, kind = stmt.selector assert not kind return np.dtype(self.TYPE_MAP[type(stmt).__name__.lower(), length]) def map_type_decl(self, node): scope = self.scope_stack[-1] tp = self.dtype_from_stmt(node) for name, shape in self.parse_dimension_specs(node.entity_decls): if shape is not None: assert name not in scope.dim_map scope.dim_map[name] = shape scope.use_name(name) assert name not in scope.type_map scope.type_map[name] = tp return [] map_Logical = map_type_decl map_Integer = map_type_decl map_Real = map_type_decl map_Complex = map_type_decl def map_Dimension(self, node): scope = self.scope_stack[-1] for name, shape in self.parse_dimension_specs(node.items): if shape is not None: assert name not in scope.dim_map scope.dim_map[name] = shape scope.use_name(name) return [] def map_External(self, node): raise NotImplementedError("external") # }}} def map_Data(self, node): scope = self.scope_stack[-1] for name, data in node.stmts: name, = name assert name not in scope.data scope.data[name] = [self.parse_expr(i) for i in data] return [] def map_Parameter(self, node): raise NotImplementedError("parameter") # {{{ I/O def map_Open(self, node): raise NotImplementedError def map_Format(self, node): warn("'format' unsupported", TranslatorWarning) def map_Write(self, node): warn("'write' unsupported", TranslatorWarning) def map_Print(self, node): warn("'print' unsupported", TranslatorWarning) def map_Read1(self, node): warn("'read' unsupported", TranslatorWarning) # }}} def map_Assignment(self, node): lhs = self.parse_expr(node.variable) from pymbolic.primitives import Subscript if isinstance(lhs, Subscript): lhs_name = lhs.aggregate.name else: lhs_name = lhs.name scope = self.scope_stack[-1] scope.use_name(lhs_name) infer_type = scope.get_type_inference_mapper() rhs = self.parse_expr(node.expr) lhs_dtype = infer_type(lhs) rhs_dtype = infer_type(rhs) # check for silent truncation of complex if lhs_dtype.kind != "c" and rhs_dtype.kind == "c": from pymbolic import var rhs = var("real")(rhs) # check for silent widening of real if lhs_dtype.kind == "c" and rhs_dtype.kind != "c": from pymbolic import var rhs = var("fromreal")(rhs) return cgen.Assign(self.gen_expr(lhs), self.gen_expr(rhs)) def map_Allocate(self, node): raise NotImplementedError("allocate") def map_Deallocate(self, node): raise NotImplementedError("deallocate") def map_Save(self, node): raise NotImplementedError("save") def map_Line(self, node): # from warnings import warn # warn("Encountered a 'line': %s" % node) raise NotImplementedError def map_Program(self, node): raise NotImplementedError def map_Entry(self, node): raise NotImplementedError # {{{ control flow def map_Goto(self, node): return cgen.Statement("goto label_%s" % node.label) def map_Call(self, node): def transform_arg(i, arg_str): expr = self.parse_expr(arg_str) result = self.gen_expr(expr) if self.arg_needs_pointer(node.designator, i): result = "&"+result cast = self.force_casts.get( (node.designator, i)) if cast is not None: result = f"({cast}) ({result})" return result return cgen.Statement("{}({})".format( node.designator, ", ".join(transform_arg(i, arg_str) for i, arg_str in enumerate(node.items)))) def map_Return(self, node): return cgen.Statement("return") def map_ArithmeticIf(self, node): raise NotImplementedError def map_If(self, node): return cgen.If(self.transform_expr(node.expr), self.rec(node.content[0])) def map_IfThen(self, node): current_cond = self.transform_expr(node.expr) blocks_and_conds = [] else_block = [] def end_block(): if current_body: if current_cond is None: else_block[:] = self.map_statement_list(current_body) else: blocks_and_conds.append( (current_cond, cgen.block_if_necessary( self.map_statement_list(current_body)))) del current_body[:] from fparser.statements import Else, ElseIf i = 0 current_body = [] while i < len(node.content): c = node.content[i] if isinstance(c, ElseIf): end_block() current_cond = self.transform_expr(c.expr) elif isinstance(c, Else): end_block() current_cond = None else: current_body.append(c) i += 1 end_block() def block_or_none(body): if not body: return None else: return cgen.block_if_necessary(body) return cgen.make_multiple_ifs( blocks_and_conds, block_or_none(else_block)) def map_EndIfThen(self, node): return [] def map_Do(self, node): scope = self.scope_stack[-1] body = self.map_statement_list(node.content) if node.loopcontrol: loop_var, loop_bounds = node.loopcontrol.split("=") loop_var = loop_var.strip() scope.use_name(loop_var) loop_bounds = [self.parse_expr(s) for s in loop_bounds.split(",")] if len(loop_bounds) == 2: start, stop = loop_bounds step = 1 elif len(loop_bounds) == 3: start, stop, step = loop_bounds else: raise RuntimeError("loop bounds not understood: %s" % node.loopcontrol) if not isinstance(step, int): print(type(step)) raise TranslationError( "non-constant steps not yet supported: %s" % step) if step < 0: comp_op = ">=" else: comp_op = "<=" return cgen.For( "{} = {}".format(loop_var, self.gen_expr(start)), "{} {} {}".format(loop_var, comp_op, self.gen_expr(stop)), "{} += {}".format(loop_var, self.gen_expr(step)), cgen.block_if_necessary(body)) else: raise NotImplementedError("unbounded do loop") def map_EndDo(self, node): return [] def map_Continue(self, node): return cgen.Statement("label_%s:" % node.label) def map_Stop(self, node): raise NotImplementedError("stop") def map_Comment(self, node): if node.content: return cgen.LineComment(node.content.strip()) else: return [] # }}} # }}} # {{{ expressions def gen_expr(self, expr): scope = self.scope_stack[-1] return CCodeMapper(self, scope)(expr) def transform_expr(self, expr_str): return self.gen_expr(self.expr_parser(expr_str)) # }}} # }}} def f2cl(source, free_form=False, strict=True, addr_space_hints={}, force_casts={}, do_arg_analysis=True, use_restrict_pointers=False, try_compile=False): from fparser import api tree = api.parse(source, isfree=free_form, isstrict=strict, analyze=False, ignore_comments=False) arg_info = ArgumentAnalayzer() if do_arg_analysis: arg_info(tree) source = F2CLTranslator(addr_space_hints, force_casts, arg_info, use_restrict_pointers=use_restrict_pointers)(tree) func_decls = [] for entry in source: if isinstance(entry, cgen.FunctionBody): func_decls.append(entry.fdecl) mod = cgen.Module([*func_decls, cgen.Line(), *source]) # open("pre-cnd.cl", "w").write(str(mod)) from cnd import transform_cl str_mod = transform_cl(str(mod)) if try_compile: import pyopencl as cl ctx = cl.create_some_context() cl.Program(ctx, """ #if __OPENCL_VERSION__ <= CL_VERSION_1_1 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif #include """).build() return str_mod def f2cl_files(source_file, target_file, **kwargs): mod = f2cl(open(source_file).read(), **kwargs) open(target_file, "w").write(mod) if __name__ == "__main__": import logging console = logging.StreamHandler() console.setLevel(logging.DEBUG) formatter = logging.Formatter("%(name)-12s: %(levelname)-8s %(message)s") console.setFormatter(formatter) logging.getLogger("fparser").addHandler(console) from cgen.opencl import CLConstant if 0: f2cl_files("hank107.f", "hank107.cl", addr_space_hints={ ("hank107p", "p"): CLConstant, ("hank107pc", "p"): CLConstant, }, force_casts={ ("hank107p", 0): "__constant cdouble_t *", }) f2cl_files("cdjseval2d.f", "cdjseval2d.cl") f2cl_files("hank103.f", "hank103.cl", addr_space_hints={ ("hank103p", "p"): CLConstant, ("hank103pc", "p"): CLConstant, }, force_casts={ ("hank103p", 0): "__constant cdouble_t *", }, try_compile=True) # vim: foldmethod=marker pyopencl-2025.1/contrib/pyopencl.vim0000644000000000000000000000451314332717401014375 0ustar00" Vim highlighting for PyOpenCL " ----------------------------- " " (C) Andreas Kloeckner 2011, MIT license " " Uses parts of mako.vim by Armin Ronacher. " " Installation: " Just drop this file into ~/.vim/syntax/pyopencl.vim " " Then do " :set filetype=pyopencl " and use " """//CL// ...code...""" " for OpenCL code included in your Python file. " " You may also include a line " vim: filetype=pyopencl.python " at the end of your file to set the file type automatically. " " Optional: Install opencl.vim from " http://www.vim.org/scripts/script.php?script_id=3157 runtime! syntax/python.vim unlet b:current_syntax try syntax include @clCode syntax/opencl.vim catch syntax include @clCode syntax/c.vim endtry unlet b:current_syntax syn include @pythonTop syntax/python.vim " {{{ mako syn region clmakoLine start="^\s*%" skip="\\$" end="$" syn region clmakoVariable start=#\${# end=#}# contains=@pythonTop syn region clmakoBlock start=#<%!# end=#%># keepend contains=@pythonTop syn match clmakoAttributeKey containedin=clmakoTag contained "[a-zA-Z_][a-zA-Z0-9_]*=" syn region clmakoAttributeValue containedin=clmakoTag contained start=/"/ skip=/\\"/ end=/"/ syn region clmakoAttributeValue containedin=clmakoTag contained start=/'/ skip=/\\'/ end=/'/ syn region clmakoTag start="" end="/\?>" " The C highlighter's paren error detection screws up highlighting of " Mako variables in C parens--turn it off. syn clear cParen syn clear cParenError if !exists("c_no_bracket_error") syn clear cBracket endif syn cluster clmakoCode contains=clmakoLine,clmakoVariable,clmakoBlock,clmakoTag hi link clmakoLine Preproc hi link clmakoVariable Preproc hi link clmakoBlock Preproc hi link clmakoTag Define hi link clmakoAttributeKey String hi link clmakoAttributeValue String " }}} syn region pythonCLString \ start=+[uU]\=\z('''\|"""\)//CL\(:[a-zA-Z_0-9]\+\)\?//+ end="\z1" keepend \ contains=@clCode,@clmakoCode syn region pythonCLRawString \ start=+[uU]\=[rR]\z('''\|"""\)//CL\(:[a-zA-Z_0-9]\+\)\?//+ end="\z1" keepend \ contains=@clCode,@clmakoCode " Uncomment if you still want the code highlighted as a string. " hi link pythonCLString String " hi link pythonCLRawString String syntax sync fromstart let b:current_syntax = "pyopencl" " vim: foldmethod=marker pyopencl-2025.1/doc/.gitignore0000644000000000000000000000001614332717401013116 0ustar00constants.inc pyopencl-2025.1/doc/Makefile0000644000000000000000000000132614332717401012573 0ustar00# Minimal makefile for Sphinx documentation # # You can set these variables from the command line, and also # from the environment for the first two. SPHINXOPTS ?= -W -n SPHINXBUILD ?= python $(shell which sphinx-build) SOURCEDIR = . BUILDDIR = _build # Put it first so that "make" without argument is like "make help". help: @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) constants: python make_constants.py > constants.inc .PHONY: help Makefile # Catch-all target: route all unknown targets to Sphinx using the new # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). %: Makefile constants @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) pyopencl-2025.1/doc/algorithm.rst0000644000000000000000000001533614332717401013661 0ustar00Parallel Algorithms =================== .. include:: subst.rst Element-wise expression evaluation ("map") ------------------------------------------ .. module:: pyopencl.elementwise Evaluating involved expressions on :class:`pyopencl.array.Array` instances by using overloaded operators can be somewhat inefficient, because a new temporary is created for each intermediate result. The functionality in the module :mod:`pyopencl.elementwise` contains tools to help generate kernels that evaluate multi-stage expressions on one or several operands in a single pass. .. autoclass:: ElementwiseKernel Here's a usage example: .. literalinclude:: ../examples/demo_elementwise.py (You can find this example as :download:`examples/demo_elementwise.py <../examples/demo_elementwise.py>` in the PyOpenCL distribution.) .. _custom-reductions: Sums and counts ("reduce") -------------------------- .. module:: pyopencl.reduction .. autoclass:: ReductionKernel Here's a usage example:: a = pyopencl.array.arange(queue, 400, dtype=numpy.float32) b = pyopencl.array.arange(queue, 400, dtype=numpy.float32) krnl = ReductionKernel(ctx, numpy.float32, neutral="0", reduce_expr="a+b", map_expr="x[i]*y[i]", arguments="__global float *x, __global float *y") my_dot_prod = krnl(a, b).get() .. _custom-scan: Prefix Sums ("scan") -------------------- .. module:: pyopencl.scan .. |scan_extra_args| replace:: a list of tuples *(name, value)* specifying extra arguments to pass to the scan procedure. For version 2013.1, *value* must be a of a :mod:`numpy` sized scalar type. As of version 2013.2, *value* may also be a :class:`pyopencl.array.Array`. .. |preamble| replace:: A snippet of C that is inserted into the compiled kernel before the actual kernel function. May be used for, e.g. type definitions or include statements. A prefix sum is a running sum of an array, as provided by e.g. :func:`numpy.cumsum`:: >>> import numpy as np >>> a = [1,1,1,1,1,2,2,2,2,2] >>> np.cumsum(a) array([ 1, 2, 3, 4, 5, 7, 9, 11, 13, 15]) This is a very simple example of what a scan can do. It turns out that scans are significantly more versatile. They are a basic building block of many non-trivial parallel algorithms. Many of the operations enabled by scans seem difficult to parallelize because of loop-carried dependencies. .. seealso:: `Prefix sums and their applications `__, by Guy Blelloch. This article gives an overview of some surprising applications of scans. :ref:`predefined-scans` These operations built into PyOpenCL are realized using :class:`GenericScanKernel`. Usage Example ^^^^^^^^^^^^^ This example illustrates the implementation of a simplified version of :func:`pyopencl.algorithm.copy_if`, which copies integers from an array into the (variable-size) output if they are greater than 300:: knl = GenericScanKernel( ctx, np.int32, arguments="__global int *ary, __global int *out", input_expr="(ary[i] > 300) ? 1 : 0", scan_expr="a+b", neutral="0", output_statement=""" if (prev_item != item) out[item-1] = ary[i]; """) out = a.copy() knl(a, out) a_host = a.get() out_host = a_host[a_host > 300] assert (out_host == out.get()[:len(out_host)]).all() The value being scanned over is a number of flags indicating whether each array element is greater than 300. These flags are computed by *input_expr*. The prefix sum over this array gives a running count of array items greater than 300. The *output_statement* the compares ``prev_item`` (the previous item's scan result, i.e. index) to ``item`` (the current item's scan result, i.e. index). If they differ, i.e. if the predicate was satisfied at this position, then the item is stored in the output at the computed index. This example does not make use of the following advanced features also available in PyOpenCL: * Segmented scans * Access to the previous item in *input_expr* (e.g. for comparisons) See the `implementation `__ of :func:`pyopencl.algorithm.unique` for an example. Making Custom Scan Kernels ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. versionadded:: 2013.1 .. autoclass:: GenericScanKernel Debugging aids ~~~~~~~~~~~~~~ .. autoclass:: GenericDebugScanKernel .. _predefined-scans: Simple / Legacy Interface ^^^^^^^^^^^^^^^^^^^^^^^^^ .. class:: ExclusiveScanKernel(ctx, dtype, scan_expr, neutral, name_prefix="scan", options=[], preamble="", devices=None) Generates a kernel that can compute a `prefix sum `__ using any associative operation given as *scan_expr*. *scan_expr* uses the formal values "a" and "b" to indicate two operands of an associative binary operation. *neutral* is the neutral element of *scan_expr*, obeying *scan_expr(a, neutral) == a*. *dtype* specifies the type of the arrays being operated on. *name_prefix* is used for kernel names to ensure recognizability in profiles and logs. *options* is a list of compiler options to use when building. *preamble* specifies a string of code that is inserted before the actual kernels. *devices* may be used to restrict the set of devices on which the kernel is meant to run. (defaults to all devices in the context *ctx*. .. method:: __call__(self, input_ary, output_ary=None, allocator=None, queue=None) .. class:: InclusiveScanKernel(ctx, dtype, scan_expr, neutral=None, name_prefix="scan", options=[], preamble="", devices=None) Works like :class:`ExclusiveScanKernel`. .. versionchanged:: 2013.1 *neutral* is now always required. For the array ``[1, 2, 3]``, inclusive scan results in ``[1, 3, 6]``, and exclusive scan results in ``[0, 1, 3]``. Here's a usage example:: knl = InclusiveScanKernel(context, np.int32, "a+b") n = 2**20-2**18+5 rng = np.random.default_rng(seed=42) host_data = rng.integers(0, 10, size=n, dtype=np.int32) dev_data = cl_array.to_device(queue, host_data) knl(dev_data) assert (dev_data.get() == np.cumsum(host_data, axis=0)).all() Predicated copies ("partition", "unique", ...) ---------------------------------------------- .. module:: pyopencl.algorithm .. autofunction:: copy_if .. autofunction:: remove_if .. autofunction:: partition .. autofunction:: unique Sorting (radix sort) -------------------- .. autoclass:: RadixSort .. automethod:: __call__ Building many variable-size lists --------------------------------- .. autoclass:: ListOfListsBuilder Bitonic Sort ------------ .. module:: pyopencl.bitonic_sort .. autoclass:: BitonicSort pyopencl-2025.1/doc/array.rst0000644000000000000000000002043114332717401013001 0ustar00Multi-dimensional arrays ======================== .. module:: pyopencl.array The functionality in this module provides something of a work-alike for :mod:`numpy` arrays, but with all operations executed on the CL compute device. Data Types ---------- PyOpenCL provides some amount of integration between the :mod:`numpy` type system, as represented by :class:`numpy.dtype`, and the types available in OpenCL. All the simple scalar types map straightforwardly to their CL counterparts. .. _vector-types: Vector Types ^^^^^^^^^^^^ .. class :: vec All of OpenCL's supported vector types, such as ``float3`` and ``long4`` are available as :mod:`numpy` data types within this class. These :class:`numpy.dtype` instances have field names of ``x``, ``y``, ``z``, and ``w`` just like their OpenCL counterparts. They will work both for parameter passing to kernels as well as for passing data back and forth between kernels and Python code. For each type, a ``make_type`` function is also provided (e.g. ``make_float3(x, y, z)``). If you want to construct a pre-initialized vector type you have three new functions to choose from: * ``zeros_type()`` * ``ones_type()`` * ``filled_type(fill_value)`` .. versionadded:: 2014.1 .. versionchanged:: 2014.1 The ``make_type`` functions have a default value (0) for each component. Relying on the default values has been deprecated. Either specify all components or use one of th new flavors mentioned above for constructing a vector. Custom data types ^^^^^^^^^^^^^^^^^ If you would like to use your own (struct/union/whatever) data types in array operations where you supply operation source code, define those types in the *preamble* passed to :class:`pyopencl.elementwise.ElementwiseKernel`, :class:`pyopencl.reduction.ReductionKernel` (or similar), and let PyOpenCL know about them using this function: .. currentmodule:: pyopencl.tools .. autofunction:: get_or_register_dtype .. exception:: TypeNameNotKnown .. versionadded:: 2013.1 .. function:: register_dtype(dtype, name) .. versionchanged:: 2013.1 This function has been deprecated. It is recommended that you develop against the new interface, :func:`get_or_register_dtype`. .. function:: dtype_to_ctype(dtype) Returns a C name registered for *dtype*. .. versionadded: 2013.1 This function helps with producing C/OpenCL declarations for structured :class:`numpy.dtype` instances: .. autofunction:: match_dtype_to_c_struct A more complete example of how to use custom structured types can be found in :file:`examples/demo-struct-reduce.py` in the PyOpenCL distribution. .. currentmodule:: pyopencl.array Complex Numbers ^^^^^^^^^^^^^^^ PyOpenCL's :class:`Array` type supports complex numbers out of the box, by simply using the corresponding :mod:`numpy` types. If you would like to use this support in your own kernels, here's how to proceed: Since OpenCL 1.2 (and earlier) do not specify native complex number support, PyOpenCL works around that deficiency. By saying:: #include in your kernel, you get complex types ``cfloat_t`` and ``cdouble_t``, along with functions defined on them such as ``cfloat_mul(a, b)`` or ``cdouble_log(z)``. Elementwise kernels automatically include the header if your kernel has complex input or output. See the `source file `__ for a precise list of what's available. If you need double precision support, please:: #define PYOPENCL_DEFINE_CDOUBLE before including the header, as DP support apparently cannot be reliably autodetected. Under the hood, the complex types are struct types as defined in the header. Ideally, you should only access the structs through the provided functions, never directly. .. versionadded:: 2012.1 .. versionchanged:: 2015.2 **[INCOMPATIBLE]** Changed PyOpenCL's complex numbers from ``float2`` and ``double2`` OpenCL vector types to custom ``struct``. This was changed because it very easily introduced bugs where * complex*complex * real+complex *look* like they may do the right thing, but silently do the wrong thing. The :class:`Array` Class ------------------------ .. autoclass:: Array .. autoexception:: ArrayHasOffsetError Constructing :class:`Array` Instances ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. autofunction:: to_device .. function:: empty(queue, shape, dtype, order="C", allocator=None, data=None) A synonym for the :class:`Array` constructor. .. autofunction:: zeros .. autofunction:: empty_like .. autofunction:: zeros_like .. autofunction:: arange .. autofunction:: take .. autofunction:: concatenate .. autofunction:: stack Manipulating :class:`Array` instances ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. autofunction:: transpose .. autofunction:: reshape Conditionals ^^^^^^^^^^^^ .. autofunction:: if_positive .. autofunction:: maximum .. autofunction:: minimum Logical Operations ^^^^^^^^^^^^^^^^^^ .. autofunction:: logical_and .. autofunction:: logical_or .. autofunction:: logical_not .. _reductions: Reductions ^^^^^^^^^^ .. autofunction:: sum .. autofunction:: all .. autofunction:: any .. autofunction:: dot .. autofunction:: vdot .. autofunction:: subset_dot .. autofunction:: max .. autofunction:: min .. autofunction:: subset_max .. autofunction:: subset_min See also :ref:`custom-reductions`. Elementwise Functions on :class:`Array` Instances ------------------------------------------------- .. module:: pyopencl.clmath The :mod:`pyopencl.clmath` module contains exposes array versions of the C functions available in the OpenCL standard. (See table 6.8 in the spec.) .. function:: acos(array, queue=None) .. function:: acosh(array, queue=None) .. function:: acospi(array, queue=None) .. function:: asin(array, queue=None) .. function:: asinh(array, queue=None) .. function:: asinpi(array, queue=None) .. function:: atan(array, queue=None) .. autofunction:: atan2 .. function:: atanh(array, queue=None) .. function:: atanpi(array, queue=None) .. autofunction:: atan2pi .. function:: cbrt(array, queue=None) .. function:: ceil(array, queue=None) .. TODO: copysign .. function:: cos(array, queue=None) .. function:: cosh(array, queue=None) .. function:: cospi(array, queue=None) .. function:: erfc(array, queue=None) .. function:: erf(array, queue=None) .. function:: exp(array, queue=None) .. function:: exp2(array, queue=None) .. function:: exp10(array, queue=None) .. function:: expm1(array, queue=None) .. function:: fabs(array, queue=None) .. TODO: fdim .. function:: floor(array, queue=None) .. TODO: fma .. TODO: fmax .. TODO: fmin .. function:: fmod(arg, mod, queue=None) Return the floating point remainder of the division ``arg / mod``, for each element in ``arg`` and ``mod``. .. TODO: fract .. function:: frexp(arg, queue=None) Return a tuple ``(significands, exponents)`` such that ``arg == significand * 2**exponent``. .. TODO: hypot .. function:: ilogb(array, queue=None) .. function:: ldexp(significand, exponent, queue=None) Return a new array of floating point values composed from the entries of ``significand`` and ``exponent``, paired together as ``result = significand * 2**exponent``. .. function:: lgamma(array, queue=None) .. TODO: lgamma_r .. function:: log(array, queue=None) .. function:: log2(array, queue=None) .. function:: log10(array, queue=None) .. function:: log1p(array, queue=None) .. function:: logb(array, queue=None) .. TODO: mad .. TODO: maxmag .. TODO: minmag .. function:: modf(arg, queue=None) Return a :class:`tuple` ``(fracpart, intpart)`` of arrays containing the integer and fractional parts of ``arg``. .. function:: nan(array, queue=None) .. TODO: nextafter .. TODO: remainder .. TODO: remquo .. function:: rint(array, queue=None) .. TODO: rootn .. function:: round(array, queue=None) .. function:: sin(array, queue=None) .. TODO: sincos .. function:: sinh(array, queue=None) .. function:: sinpi(array, queue=None) .. function:: sqrt(array, queue=None) .. function:: tan(array, queue=None) .. function:: tanh(array, queue=None) .. function:: tanpi(array, queue=None) .. function:: tgamma(array, queue=None) .. function:: trunc(array, queue=None) Generating Arrays of Random Numbers ----------------------------------- .. automodule:: pyopencl.clrandom pyopencl-2025.1/doc/conf.py0000644000000000000000000000153514332717401012434 0ustar00from urllib.request import urlopen _conf_url = \ "https://raw.githubusercontent.com/inducer/sphinxconfig/main/sphinxconfig.py" with urlopen(_conf_url) as _inf: exec(compile(_inf.read(), _conf_url, "exec"), globals()) exclude_patterns = ["subst.rst"] copyright = "2009-21, Andreas Kloeckner" ver_dic = {} with open("../pyopencl/version.py") as ver_file: ver_src = ver_file.read() exec(compile(ver_src, "../pyopencl/version.py", "exec"), ver_dic) version = ".".join(str(x) for x in ver_dic["VERSION"]) # The full version, including alpha/beta/rc tags. release = ver_dic["VERSION_TEXT"] intersphinx_mapping = { "python": ("https://docs.python.org/3", None), "numpy": ("https://numpy.org/doc/stable/", None), "mako": ("https://docs.makotemplates.org/en/latest", None), "pytools": ("https://documen.tician.de/pytools", None), } pyopencl-2025.1/doc/howto.rst0000644000000000000000000000655014332717401013031 0ustar00How-tos ======= How to use struct types with PyOpenCL ------------------------------------- We import and initialize PyOpenCL as usual: .. doctest:: :options: +ELLIPSIS >>> import numpy as np >>> import pyopencl as cl >>> import pyopencl.tools >>> import pyopencl.array >>> ctx = cl.create_some_context(interactive=False) >>> queue = cl.CommandQueue(ctx) Then, suppose we would like to declare a struct consisting of an integer and a floating point number. We first create a :class:`numpy.dtype` along these lines: .. doctest:: >>> my_struct = np.dtype([("field1", np.int32), ("field2", np.float32)]) >>> print(my_struct) [('field1', '`__. So as a first step, we match our dtype against CL's version: .. doctest:: >>> my_struct, my_struct_c_decl = cl.tools.match_dtype_to_c_struct( ... ctx.devices[0], "my_struct", my_struct) >>> print(my_struct_c_decl) typedef struct { int field1; float field2; } my_struct; We then tell PyOpenCL about our new type. .. doctest:: >>> my_struct = cl.tools.get_or_register_dtype("my_struct", my_struct) Next, we can create some data of that type on the host and transfer it to the device: .. doctest:: >>> ary_host = np.empty(20, my_struct) >>> ary_host["field1"].fill(217) >>> ary_host["field2"].fill(1000) >>> ary_host[13]["field2"] = 12 >>> print(ary_host) #doctest: +NORMALIZE_WHITESPACE [(217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 12.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.) (217, 1000.)] >>> ary = cl.array.to_device(queue, ary_host) We can then operate on the array with our own kernels: .. doctest:: >>> prg = cl.Program(ctx, my_struct_c_decl + """ ... __kernel void set_to_1(__global my_struct *a) ... { ... a[get_global_id(0)].field1 = 1; ... } ... """).build() >>> evt = prg.set_to_1(queue, ary.shape, None, ary.data) >>> print(ary) #doctest: +NORMALIZE_WHITESPACE [(1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 12.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.) (1, 1000.)] as well as with PyOpenCL's built-in operations: .. doctest:: >>> from pyopencl.elementwise import ElementwiseKernel >>> elwise = ElementwiseKernel(ctx, "my_struct *a", "a[i].field1 = 2;", ... preamble=my_struct_c_decl) >>> evt = elwise(ary) >>> print(ary) #doctest: +NORMALIZE_WHITESPACE [(2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 12.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.) (2, 1000.)] pyopencl-2025.1/doc/index.rst0000644000000000000000000001253114332717401012774 0ustar00Welcome to PyOpenCL's documentation! ==================================== PyOpenCL gives you easy, Pythonic access to the `OpenCL `__ parallel computation API. What makes PyOpenCL special? * Object cleanup tied to lifetime of objects. This idiom, often called `RAII `__ in C++, makes it much easier to write correct, leak- and crash-free code. * Completeness. PyOpenCL puts the full power of OpenCL's API at your disposal, if you wish. Every obscure ``get_info()`` query and all CL calls are accessible. * Automatic Error Checking. All errors are automatically translated into Python exceptions. * Speed. PyOpenCL's base layer is written in C++, so all the niceties above are virtually free. * Helpful Documentation. You're looking at it. ;) * Liberal license. PyOpenCL is open-source under the :ref:`MIT license ` and free for commercial, academic, and private use. Here's an example, to give you an impression: .. literalinclude:: ../examples/demo.py (You can find this example as :download:`examples/demo.py <../examples/demo.py>` in the PyOpenCL source distribution.) Tutorials ========= * Gaston Hillar's `two-part article series `__ in Dr. Dobb's Journal provides a friendly introduction to PyOpenCL. * `Simon McIntosh-Smith `__ and `Tom Deakin `__'s course `Hands-on OpenCL `__ contains both `lecture slides `__ and `exercises (with solutions) `__ (The course covers PyOpenCL as well as OpenCL's C and C++ APIs.) * PyOpenCL course at `PASI `__: Parts `1 `__ `2 `__ `3 `__ `4 `__ (YouTube, 2011) * PyOpenCL course at `DTU GPULab `__ and `Simula `__ (2011): `Lecture 1 `__ `Lecture 2 `__ `Problem set 1 `__ `Problem set 2 `__ * Ian Johnson's `PyOpenCL tutorial `__. Software that works with or enhances PyOpenCL ============================================= * Jon Roose's `pyclblas `__ (`code `__) makes BLAS in the form of `clBLAS `__ available from within :mod:`pyopencl` code. Two earlier wrappers continue to be available: one by `Eric Hunsberger `__ and one by `Lars Ericson `__. * Cedric Nugteren provides a wrapper for the `CLBlast `__ OpenCL BLAS library: `PyCLBlast `__. * Gregor Thalhammer's `gpyfft `__ provides a Python wrapper for the OpenCL FFT library clFFT from AMD. * Bogdan Opanchuk's `reikna `__ offers a variety of GPU-based algorithms (FFT, random number generation, matrix multiplication) designed to work with :class:`pyopencl.array.Array` objects. * Troels Henriksen, Ken Friis Larsen, and Cosmin Oancea's `Futhark `__ programming language offers a nice way to code nested-parallel programs with reductions and scans on data in :class:`pyopencl.array.Array` instances. * Robbert Harms and Alard Roebroeck's `MOT `__ offers a variety of GPU-enabled non-linear optimization algorithms and MCMC sampling routines for parallel optimization and sampling of multiple problems. * Vincent Favre-Nicolin's `pyvkfft `__ makes `vkfft `__ accessible from PyOpenCL. If you know of a piece of software you feel that should be on this list, please let me know, or, even better, send a patch! Contents ======== .. toctree:: :maxdepth: 2 runtime runtime_const runtime_platform runtime_queue runtime_memory runtime_program runtime_gl tools array types algorithm howto misc 🚀 Github 💾 Download Releases Note that this guide does not explain OpenCL programming and technology. Please refer to the official `Khronos OpenCL documentation `__ for that. PyOpenCL also has its own `web site `__, where you can find updates, new versions, documentation, and support. Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` pyopencl-2025.1/doc/make_constants.py0000644000000000000000000004261514332717401014524 0ustar00__copyright__ = "Copyright (C) 2009 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import pyopencl as cl fission = ("cl_ext_device_fission", "2011.1") nv_devattr = ("cl_nv_device_attribute_query", "0.92") gl_sharing = ("cl_khr_gl_sharing", "0.92") cl_spir_devattr = ("cl_khr_spir", "2016.2") cl_11 = ("CL_1.1", "0.92") cl_12 = ("CL_1.2", "2011.2") cl_12_2015 = ("CL_1.2", "2015.2") cl_20 = ("CL_2.0", "2015.2") cl_21_late = ("CL_2.1", "2020.3") cl_21 = ("CL_2.1", "2016.2") cl_22 = ("CL_2.1", "2020.3") cl_30 = ("CL_3.0", "2020.3") amd_devattr = ("cl_amd_device_attribute_query", "2013.2") qcom_hp_devattr = ("cl_qcom_ext_host_ptr", "2016.2") intel_me_devattr = ("cl_intel_advanced_motion_estimation", "2016.2") intel_ss_devattr = ("cl_intel_simultaneous_sharing", "2016.2") altera_temp_devattr = ("cl_altera_device_temperature", "2016.2") def get_extra_lines(tup): ext_name, pyopencl_ver = tup if ext_name is not None: if ext_name.startswith("CL_"): # capital letters -> CL version, not extension yield "" yield " Available with OpenCL %s." % ( ext_name[3:]) yield "" else: yield "" yield " Available with the ``%s`` extension." % ext_name yield "" if pyopencl_ver is not None: yield "" yield " .. versionadded:: %s" % pyopencl_ver yield "" const_ext_lookup = { cl.status_code: { "PLATFORM_NOT_FOUND_KHR": ("cl_khr_icd", "2011.1"), "INVALID_GL_SHAREGROUP_REFERENCE_KHR": gl_sharing, "MISALIGNED_SUB_BUFFER_OFFSET": cl_11, "EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST": cl_11, "INVALID_GLOBAL_WORK_SIZE": cl_11, "COMPILE_PROGRAM_FAILURE": cl_12, "LINKER_NOT_AVAILABLE": cl_12, "LINK_PROGRAM_FAILURE": cl_12, "DEVICE_PARTITION_FAILED": cl_12, "KERNEL_ARG_INFO_NOT_AVAILABLE": cl_12, "INVALID_IMAGE_DESCRIPTOR": cl_12, "INVALID_COMPILER_OPTIONS": cl_12, "INVALID_LINKER_OPTIONS": cl_12, "INVALID_DEVICE_PARTITION_COUNT": cl_12, "INVALID_PIPE_SIZE": cl_20, "INVALID_DEVICE_QUEUE": cl_20, "INVALID_SPEC_ID": cl_22, "MAX_SIZE_RESTRICTION_EXCEEDED": cl_22, }, cl.device_info: { "PREFERRED_VECTOR_WIDTH_HALF": cl_11, "HOST_UNIFIED_MEMORY": cl_11, "NATIVE_VECTOR_WIDTH_CHAR": cl_11, "NATIVE_VECTOR_WIDTH_SHORT": cl_11, "NATIVE_VECTOR_WIDTH_INT": cl_11, "NATIVE_VECTOR_WIDTH_LONG": cl_11, "NATIVE_VECTOR_WIDTH_FLOAT": cl_11, "NATIVE_VECTOR_WIDTH_DOUBLE": cl_11, "NATIVE_VECTOR_WIDTH_HALF": cl_11, "OPENCL_C_VERSION": cl_11, "SPIR_VERSIONS": cl_spir_devattr, "COMPUTE_CAPABILITY_MAJOR_NV": nv_devattr, "COMPUTE_CAPABILITY_MINOR_NV": nv_devattr, "REGISTERS_PER_BLOCK_NV": nv_devattr, "WARP_SIZE_NV": nv_devattr, "GPU_OVERLAP_NV": nv_devattr, "KERNEL_EXEC_TIMEOUT_NV": nv_devattr, "INTEGRATED_MEMORY_NV": nv_devattr, "ATTRIBUTE_ASYNC_ENGINE_COUNT_NV": nv_devattr, "PCI_BUS_ID_NV": nv_devattr, "PCI_SLOT_ID_NV": nv_devattr, "PCI_DOMAIN_ID_NV": nv_devattr, "DOUBLE_FP_CONFIG": ("cl_khr_fp64", "2011.1"), "HALF_FP_CONFIG": ("cl_khr_fp16", "2011.1"), "PROFILING_TIMER_OFFSET_AMD": amd_devattr, "TOPOLOGY_AMD": amd_devattr, "BOARD_NAME_AMD": amd_devattr, "GLOBAL_FREE_MEMORY_AMD": amd_devattr, "SIMD_PER_COMPUTE_UNIT_AMD": amd_devattr, "SIMD_WIDTH_AMD": amd_devattr, "SIMD_INSTRUCTION_WIDTH_AMD": amd_devattr, "WAVEFRONT_WIDTH_AMD": amd_devattr, "GLOBAL_MEM_CHANNELS_AMD": amd_devattr, "GLOBAL_MEM_CHANNEL_BANKS_AMD": amd_devattr, "GLOBAL_MEM_CHANNEL_BANK_WIDTH_AMD": amd_devattr, "LOCAL_MEM_SIZE_PER_COMPUTE_UNIT_AMD": amd_devattr, "LOCAL_MEM_BANKS_AMD": amd_devattr, "THREAD_TRACE_SUPPORTED_AMD": amd_devattr, "GFXIP_MAJOR_AMD": amd_devattr, "GFXIP_MINOR_AMD": amd_devattr, "AVAILABLE_ASYNC_QUEUES_AMD": amd_devattr, "ME_VERSION_INTEL": intel_me_devattr, "SIMULTANEOUS_INTEROPS_INTEL": intel_ss_devattr, "NUM_SIMULTANEOUS_INTEROPS_INTEL": intel_ss_devattr, "EXT_MEM_PADDING_IN_BYTES_QCOM": qcom_hp_devattr, "PAGE_SIZE_QCOM": qcom_hp_devattr, "CORE_TEMPERATURE_ALTERA": altera_temp_devattr, "MAX_ATOMIC_COUNTERS_EXT": ("cl_ext_atomic_counters_64", "2013.2"), "PARENT_DEVICE_EXT": fission, "PARTITION_TYPES_EXT": fission, "AFFINITY_DOMAINS_EXT": fission, "REFERENCE_COUNT_EXT": fission, "PARTITION_STYLE_EXT": fission, "LINKER_AVAILABLE": cl_12, "BUILT_IN_KERNELS": cl_12, "IMAGE_MAX_BUFFER_SIZE": cl_12, "IMAGE_MAX_ARRAY_SIZE": cl_12, "PARENT_DEVICE": cl_12, "PARTITION_MAX_SUB_DEVICES": cl_12, "PARTITION_PROPERTIES": cl_12, "PARTITION_AFFINITY_DOMAIN": cl_12, "PARTITION_TYPE": cl_12, "REFERENCE_COUNT": cl_12, "PREFERRED_INTEROP_USER_SYNC": cl_12, "PRINTF_BUFFER_SIZE": cl_12, "DEVICE_ON_HOST_PROPERTIES": cl_20, "MAX_READ_WRITE_IMAGE_ARGS": cl_20, "MAX_GLOBAL_VARIABLE_SIZE": cl_20, "QUEUE_ON_DEVICE_PROPERTIES": cl_20, "QUEUE_ON_DEVICE_PREFERRED_SIZE": cl_20, "QUEUE_ON_DEVICE_MAX_SIZE": cl_20, "MAX_ON_DEVICE_QUEUES": cl_20, "MAX_ON_DEVICE_EVENTS": cl_20, "SVM_CAPABILITIES": cl_20, "GLOBAL_VARIABLE_PREFERRED_TOTAL_SIZE": cl_20, "MAX_PIPE_ARGS": cl_20, "PIPE_MAX_ACTIVE_RESERVATIONS": cl_20, "PIPE_MAX_PACKET_SIZE": cl_20, "PREFERRED_PLATFORM_ATOMIC_ALIGNMENT": cl_20, "PREFERRED_GLOBAL_ATOMIC_ALIGNMENT": cl_20, "PREFERRED_LOCAL_ATOMIC_ALIGNMENT": cl_20, "IL_VERSION": cl_21, "MAX_NUM_SUB_GROUPS": cl_21, "SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS": cl_21, "NUMERIC_VERSION": cl_30, "EXTENSIONS_WITH_VERSION": cl_30, "ILS_WITH_VERSION": cl_30, "BUILT_IN_KERNELS_WITH_VERSION": cl_30, "ATOMIC_MEMORY_CAPABILITIES": cl_30, "ATOMIC_FENCE_CAPABILITIES": cl_30, "NON_UNIFORM_WORK_GROUP_SUPPORT": cl_30, "OPENCL_C_ALL_VERSIONS": cl_30, "PREFERRED_WORK_GROUP_SIZE_MULTIPLE": cl_30, "WORK_GROUP_COLLECTIVE_FUNCTIONS_SUPPORT": cl_30, "GENERIC_ADDRESS_SPACE_SUPPORT": cl_30, "OPENCL_C_FEATURES": cl_30, "DEVICE_ENQUEUE_CAPABILITIES": cl_30, "PIPE_SUPPORT": cl_30, }, cl.device_topology_type_amd: { "PCIE": amd_devattr, }, cl.mem_object_type: { "IMAGE2D_ARRAY": cl_12, "IMAGE1D": cl_12, "IMAGE1D_ARRAY": cl_12, "IMAGE1D_BUFFER": cl_12, "PIPE": cl_20, }, cl.device_type: { "CUSTOM": cl_12, }, cl.context_properties: { "GL_CONTEXT_KHR": gl_sharing, "EGL_DISPLAY_KHR": gl_sharing, "GLX_DISPLAY_KHR": gl_sharing, "WGL_HDC_KHR": gl_sharing, "CGL_SHAREGROUP_KHR": gl_sharing, "OFFLINE_DEVICES_AMD": ("cl_amd_offline_devices", "2011.1"), }, cl.device_fp_config: { "SOFT_FLOAT": cl_11, "CORRECTLY_ROUNDED_DIVIDE_SQRT": cl_12, }, cl.command_queue_properties: { "ON_DEVICE": cl_20, "ON_DEVICE_DEFAULT": cl_20, }, cl.context_info: { "NUM_DEVICES": cl_11, "INTEROP_USER_SYNC": cl_12, }, cl.channel_type: { "UNORM_INT24": ("CL_1.2", "2020.3"), "UNORM_INT_101010_2": ("CL_2.1", "2020.3"), }, cl.channel_order: { "Rx": cl_11, "RGx": cl_11, "RGBx": cl_11, "sRGB": cl_20, "sRGBx": cl_20, "sRGBA": cl_20, "sBGRA": cl_20, "ABGR": cl_20, }, cl.kernel_work_group_info: { "PREFERRED_WORK_GROUP_SIZE_MULTIPLE": cl_11, "PRIVATE_MEM_SIZE": cl_11, "GLOBAL_WORK_SIZE": cl_12, }, cl.kernel_sub_group_info: { "MAX_SUB_GROUP_SIZE_FOR_NDRANGE": cl_21_late, "SUB_GROUP_COUNT_FOR_NDRANGE": cl_21_late, "LOCAL_SIZE_FOR_SUB_GROUP_COUNT": cl_21_late, "MAX_NUM_SUB_GROUPS": cl_21_late, "COMPILE_NUM_SUB_GROUPS": cl_21_late, }, cl.addressing_mode: { "MIRRORED_REPEAT": cl_11, }, cl.sampler_info: { "MIP_FILTER_MODE": ("(deprecated)", "2015.2"), "LOD_MIN": ("(deprecated)", "2015.2"), "LOD_MAX": ("(deprecated)", "2015.2"), "MIP_FILTER_MODE_KHR": ("cl_khr_mipmap_image", "2020.3"), "LOD_MIN_KHR": ("cl_khr_mipmap_image", "2020.3"), "LOD_MAX_KHR": ("cl_khr_mipmap_image", "2020.3"), "PROPERTIES": cl_30, }, cl.event_info: { "CONTEXT": cl_11, }, cl.mem_info: { "ASSOCIATED_MEMOBJECT": cl_11, "OFFSET": cl_11, "USES_SVM_POINTER": cl_20, }, cl.image_info: { "ARRAY_SIZE": cl_12, "BUFFER": cl_12, "NUM_MIP_LEVELS": cl_12, "NUM_SAMPLES": cl_12, }, cl.pipe_info: { "PACKET_SIZE": ("CL_2.0", "2020.3"), "MAX_PACKETS": ("CL_2.0", "2020.3"), "PROPERTIES": cl_30, }, cl.pipe_properties: { "PACKET_SIZE": ("CL_2.0", "2020.3"), "MAX_PACKETS": ("CL_2.0", "2020.3"), }, cl.map_flags: { "WRITE_INVALIDATE_REGION": cl_12, }, cl.program_info: { "NUM_KERNELS": cl_12, "KERNEL_NAMES": cl_12, "PROGRAM_IL": cl_21_late, "SCOPE_GLOBAL_CTORS_PRESENT": cl_22, "SCOPE_GLOBAL_DTORS_PRESENT": cl_22, }, cl.program_build_info: { "BINARY_TYPE": cl_12, "GLOBAL_VARIABLE_TOTAL_SIZE": cl_20, }, cl.program_binary_type: { "NONE": cl_12, "COMPILED_OBJECT": cl_12, "LIBRARY": cl_12, "EXECUTABLE": cl_12, }, cl.kernel_info: { "ATTRIBUTES": cl_12, }, cl.kernel_arg_info: { "ADDRESS_QUALIFIER": cl_12, "ACCESS_QUALIFIER": cl_12, "TYPE_NAME": cl_12, "TYPE_QUALIFIER": cl_12_2015, "ARG_NAME": cl_12, }, cl.kernel_arg_address_qualifier: { "GLOBAL": cl_12, "LOCAL": cl_12, "CONSTANT": cl_12, "PRIVATE": cl_12, }, cl.kernel_arg_access_qualifier: { "READ_ONLY": cl_12, "WRITE_ONLY": cl_12, "READ_WRITE": cl_12, "NONE": cl_12, }, cl.kernel_arg_type_qualifier: { "NONE": cl_12_2015, "CONST": cl_12_2015, "RESTRICT": cl_12_2015, "VOLATILE": cl_12_2015, "PIPE": cl_20, }, cl.command_type: { "READ_BUFFER_RECT": cl_11, "WRITE_BUFFER_RECT": cl_11, "COPY_BUFFER_RECT": cl_11, "USER": cl_11, "BARRIER": cl_12, "MIGRATE_MEM_OBJECTS": cl_12, "FILL_BUFFER": cl_12, "FILL_IMAGE": cl_12, "SVM_FREE": cl_20, "SVM_MEMCPY": cl_20, "SVM_MEMFILL": cl_20, "SVM_MAP": cl_20, "SVM_UNMAP": cl_20, "SVM_MIGRATE_MEM": cl_30, }, cl.command_queue_info: { "SIZE": cl_20, }, cl.queue_properties: { "PROPERTIES": cl_20, "SIZE": cl_20, }, cl.mem_flags: { "USE_PERSISTENT_MEM_AMD": ("cl_amd_device_memory_flags", "2011.1"), "HOST_WRITE_ONLY": cl_12, "KERNEL_READ_AND_WRITE": cl_20, }, cl.svm_mem_flags: { "READ_WRITE": cl_20, "WRITE_ONLY": cl_20, "READ_ONLY": cl_20, "SVM_FINE_GRAIN_BUFFER": cl_20, "SVM_ATOMICS": cl_20, }, cl.device_svm_capabilities: { "COARSE_GRAIN_BUFFER": cl_20, "FINE_GRAIN_BUFFER": cl_20, "FINE_GRAIN_SYSTEM": cl_20, "ATOMICS": cl_20, }, cl.device_partition_property: { "EQUALLY": cl_12, "BY_COUNTS": cl_12, "BY_NAMES": cl_12, "BY_AFFINITY_DOMAIN": cl_12, "PROPERTIES_LIST_END": cl_12, "PARTITION_BY_COUNTS_LIST_END": cl_12, "PARTITION_BY_NAMES_LIST_END": cl_12, }, cl.device_affinity_domain: { "NUMA": cl_12, "L4_CACHE": cl_12, "L3_CACHE": cl_12, "L2_CACHE": cl_12, "L1_CACHE": cl_12, "NEXT_PARITIONNABLE": cl_12, }, cl.device_atomic_capabilities: { "ORDER_RELAXED": cl_30, "ORDER_ACQ_REL": cl_30, "ORDER_SEQ_CST": cl_30, "SCOPE_WORK_ITEM": cl_30, "SCOPE_WORK_GROUP": cl_30, "SCOPE_DEVICE": cl_30, "SCOPE_ALL_DEVICES": cl_30, }, cl.device_device_enqueue_capabilities: { "SUPPORTED": cl_30, "REPLACEABLE_DEFAULT": cl_30, }, cl.profiling_info: { "COMPLETE": cl_20, }, cl.mem_migration_flags: { "HOST": cl_12, "CONTENT_UNDEFINED": cl_12, }, cl.version_bits: { "MAJOR_BITS": cl_30, "MINOR_BITS": cl_30, "PATCH_BITS": cl_30, "MAJOR_MASK": cl_30, "MINOR_MASK": cl_30, "PATCH_MASK": cl_30, }, cl.khronos_vendor_id: { "CODEPLAY": cl_30, }, } try: gl_ci = cl.gl_context_info except AttributeError: pass else: const_ext_lookup[gl_ci] = { getattr(gl_ci, "CURRENT_DEVICE_FOR_GL_CONTEXT_KHR", None): gl_sharing, getattr(gl_ci, "DEVICES_FOR_GL_CONTEXT_KHR", None): gl_sharing, } cls_ext_lookup = { # cl.buffer_create_type: ("CL_1.1", "0.92"), } def doc_class(cls): print(".. class :: %s" % cls.__name__) print() if cls.__name__.startswith("gl_"): print(" Only available when PyOpenCL is compiled with GL support.") print(" See :func:`have_gl`.") print() if cls in cls_ext_lookup: for ln in get_extra_lines(cls_ext_lookup[cls]): print(ln) cls_const_ext = const_ext_lookup.get(cls, {}) for name in sorted(dir(cls)): if not name.startswith("_") and name not in ["to_string", "names", "values"]: print(" .. attribute :: %s" % name) if name in cls_const_ext: for ln in get_extra_lines(cls_const_ext[name]): print(" "+ln) print(" .. method :: to_string(value)") print() print(" Returns a :class:`str` representing *value*.") print() print(" .. versionadded:: 0.91") print() if not cl.have_gl(): print(".. warning::") print() print(" This set of PyOpenCL documentation is incomplete because it") print(" was generated on a PyOpenCL build that did not support OpenGL.") print() import inspect CONSTANT_CLASSES = [ getattr(cl, name) for name in dir(cl) if inspect.isclass(getattr(cl, name)) and name[0].islower() and name not in ["zip", "map", "range"]] print(".. This is an automatically generated file. DO NOT EDIT") print() for cls in CONSTANT_CLASSES: doc_class(cls) pyopencl-2025.1/doc/misc.rst0000644000000000000000000007317614332717401012634 0ustar00Installation ============ Installing from Conda Forge --------------------------- Installing PyOpenCL ^^^^^^^^^^^^^^^^^^^ By far the easiest way to install PyOpenCL is to use the packages available in `Conda Forge `__. Conda Forge is a repository of community-maintained packages for the `Conda `__ package manager. The following instructions are aimed at Linux and macOS. The analogous steps for Windows should also work. Install a version of `miniforge `__ that fits your system:: curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" bash ./Miniforge3-*.sh # (answer questions, pick install location) Then run:: source /WHERE/YOU/INSTALLED/MINIFORGE/bin/activate root conda install pyopencl You can install these pieces of software in your user account and do not need root/administrator privileges. .. note:: This installs a conda environment based on Conda Forge. This is not interchangeable with a conda environment based on the (more common) anaconda. If you have an existing conda environment sitting around, just following the instructions below will likely not work. Instead, the suggested approach is to create new environment from scratch, starting with miniforge, above. Enabling access to CPUs and GPUs via (Py)OpenCL ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that PyOpenCL is no fun (i.e. cannot run code) without an OpenCL device driver (a so-called "ICD", for "installable client driver") that provides access to hardware through OpenCL. If you get an error message like ``pyopencl._cl.LogicError: clGetPlatformIDs failed: PLATFORM_NOT_FOUND_KHR``, that means PyOpenCL installed successfully, but you have no OpenCL drivers installed. Note that drivers (ICDs) are separate pieces of software from PyOpenCL. They might be provided by your hardware vendor (e.g. for Nvidia or AMD GPUs). If you have such hardware, see below for instructions on how to make those work with PyOpenCL from Conda Forge. It is important to note that OpenCL is not restricted to GPUs. In fact, no special hardware is required to use OpenCL for computation--your existing CPU is enough. On Linux or macOS, type:: conda install pocl to install a CPU-based OpenCL driver. On macOS, PoCL can offer a marked robustness (and, sometimes, performance) improvement over the OpenCL drivers built into the operating system. On Linux and Windows, you can use Intel's CPU OpenCL runtime:: conda install intel-opencl-rt On Linux Intel Broadwell or newer processors with an Intel graphics card, you can use NEO:: conda install intel-compute-runtime On Linux Intel Sandybridge or newer processors with an Intel graphics card, you can use Beignet:: conda install beignet On Linux, Windows and macOS, you can use Oclgrind to detect memory access errors:: conda install oclgrind You are now ready to run code based on PyOpenCL, such as the `code examples `__. Using vendor-supplied OpenCL drivers (mainly on Linux) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The instructions above help you get a basic OpenCL environment going that will work independently of whether you have specialized hardware (such as GPUs or FPGAs) available. If you *do* have such hardware, read on for how to make it work. On Linux, PyOpenCL finds which drivers are installed by looking for files with the extension ``.icd`` in a directory. PyOpenCL as installed from Conda will look for these files in :file:`/WHERE/YOU/INSTALLED/MINICONDA/etc/OpenCL/vendors`. They are just simple text files containing either just the file names or the fully qualified path names of the shared library providing the OpenCL driver. .. note:: If you ran the commands above in a `Conda environment `__ (i.e. if the environment indicator on your command line prompt says anything other than ``(root)``), then you may need to use a path like the following instead: :file:`/WHERE/YOU/INSTALLED/MINICONDA/envs/ENVIRONMENTNAME/etc/OpenCL/vendors` Note that you should replace ``ENVIRONMENTNAME`` with the name of your environment, shown between parentheses on your command line prompt. This path (for the currently-active conda environment) can be obtained from the environment variable ``CONDA_PREFIX``, i.e., :file:`$CONDA_PREFIX/etc/OpenCL/vendors` (once the Conda environment is activated). On Linux, if you have other OpenCL drivers installed (such as for your GPU), those will be in :file:`/etc/OpenCL/vendors`. You can make them work with PyOpenCL from Conda Forge by using the command:: conda install ocl-icd-system will make sure these system-wide ICDs are also visible in your conda environment. As an alternative, one may manually copy ICD files from :file:`/etc/OpenCL/vendors` into, e.g., :file:`$CONDA_PREFIX/etc/OpenCL/vendors`. If you are looking for more information, see `ocl-icd `__ and its documentation. Ocl-icd is the "ICD loader" used by PyOpenCL when installed from Conda Forge on Linux. It represents the code behind :file:`libOpenCL.so`. On macOS, using the command:: conda install ocl_icd_wrapper_apple will make sure that the Apple provided CPU and GPU implementations are available. On Windows, the packaging of PyOpenCL for Conda Forge relies on the `Khronos ICD Loader `__, and it is packaged so that the OpenCL drivers that are registered in the OS using registry keys are automatically available. Installing from PyPI wheels --------------------------- PyOpenCL distributes wheels for most popular OSs and Python versions. To check available versions please visit `PyPI page for PyOpenCL `__. On Linux, the wheels come with `OCL-ICD `__ bundled and configured to use any OpenCL implementation supporting the ICD interface and listed in :file:`/etc/OpenCL/vendors`. Wheels for Windows and MacOS are built using the ICD Loader from the Khronos Group. To install, type:: pip install pyopencl You can also install the following CPU based OpenCL implementation using pip shipped as binary wheels. Note that pyopencl has to be installed using a wheel for pyopencl to recognize these wheels. To install pyopencl with PoCL, a CPU based implementation do:: pip install pyopencl[pocl] To install pyopencl with oclgrind, an OpenCL debugger do:: pip install pyopencl[oclgrind] .. note:: Avoid mixing components installed from Conda Forge and PyPI. For example, installing PyOpenCL from pip followed by OCL-ICD from Conda Forge can redirect the ICD loader, removing access to system-wide ICDs. Installing from source ---------------------- Installing PyOpenCL *from source* should mostly not be necessary unless you have very specific needs or would like to modify PyOpenCL yourself. You can find generic installation instructions for ``nanobind``-based packages `here `__. For PyOpenCL, the basic process is as follows: .. code-block:: bash $ cd pyopencl # non-editable install: $ pip install -v . # editable install - make sure to disable build isolation: $ pip install nanobind scikit-build-core[pyproject] numpy ninja $ pip install --no-build-isolation -ve . # editable install with automatic recompilation if needed (somewhat experimental): $ pip install --no-build-isolation -Ceditable.rebuild=true -Cbuild-dir=build -ve . PyOpenCL will attempt to automatically find and use the OpenCL headers and libraries while building. You can also specify the paths to the OpenCL headers and libraries manually: .. code-block:: bash # Option 1: specify the paths via environment variables: $ export CL_INC_DIR= $ export CL_LIB_DIR= $ export CL_LIBNAME= # Option 2: specify the paths via arguments to pip install: $ pip install -v . --config-settings='cmake.args=-DCL_INC_DIR=/path/to/OpenCL/include;-DCL_LIB_DIR=/path/to/OpenCL/lib' Tips ==== Syntax highlighting ------------------- You can obtain Vim syntax highlighting for OpenCL C inlined in Python by checking `this file `__. Note that the triple-quoted strings containing the source must start with ``"""//CL// ..."""``. .. _ipython-integration: IPython integration ------------------- PyOpenCL comes with IPython integration, which lets you seamlessly integrate PyOpenCL kernels into your IPython notebooks. Simply load the PyOpenCL IPython extension using:: %load_ext pyopencl.ipython_ext and then use the ``%%cl_kernel`` 'cell-magic' command. See `this notebook `__ (which ships with PyOpenCL) for a demonstration. You can pass build options to be used for building the program executable by using the ``-o`` flag on the first line of the cell (next to the ``%%cl_kernel`` directive). For example:: %%cl_kernel -o "-cl-fast-relaxed-math" There are also line magics: ``cl_load_edit_kernel`` which will load a file into the next cell (adding ``cl_kernel`` to the first line) and ``cl_kernel_from_file`` which will compile kernels from a file (as if you copy-and-pasted the contents of the file to a cell with ``cl_kernel``). Both of these magics take options ``-f`` to specify the file and optionally ``-o`` for build options. .. versionadded:: 2014.1 Guidelines ========== .. _api-compatibility: API Stability ------------- I consider PyOpenCL's API "stable". That doesn't mean it can't change. But if it does, your code will generally continue to run. It may however start spewing warnings about things you need to change to stay compatible with future versions. Deprecation warnings will be around for a whole year, as identified by the first number in the release name. (the "2014" in "2014.1") I.e. a function that was deprecated in 2014.n will generally be removed in 2015.n (or perhaps later). Further, the stability promise applies for any code that's part of a released version. It doesn't apply to undocumented bits of the API, and it doesn't apply to unreleased code downloaded from git. .. _versus-c: Relation with OpenCL's C Bindings --------------------------------- We've tried to follow these guidelines when binding the OpenCL's C interface to Python: * Remove the ``cl_``, ``CL_`` and ``cl`` prefix from data types, macros and function names. * Follow :pep:`8`, i.e. * Make function names lowercase. * If a data type or function name is composed of more than one word, separate the words with a single underscore. * ``get_info`` functions become attributes. * Object creation is done by constructors, to the extent possible. (i.e. minimize use of "factory functions") * If an operation involves two or more "complex" objects (like e.g. a kernel enqueue involves a kernel and a queue), refuse the temptation to guess which one should get a method for the operation. Instead, simply leave that command to be a function. .. _interoperability: Interoperability with other OpenCL software ------------------------------------------- Just about every object in :mod:`pyopencl` supports the following interface (here shown as an example for :class:`pyopencl.MemoryObject`, from which :class:`pyopencl.Buffer` and :class:`pyopencl.Image` inherit): * :meth:`pyopencl.MemoryObject.from_int_ptr` * :attr:`pyopencl.MemoryObject.int_ptr` This allows retrieving the C-level pointer to an OpenCL object as a Python integer, which may then be passed to other C libraries whose interfaces expose OpenCL objects. It also allows turning C-level OpenCL objects obtained from other software to be turned into the corresponding :mod:`pyopencl` objects. .. versionadded:: 2013.2 User-visible Changes ==================== Unreleased ---------- .. note:: This version is currently under development. You can get snapshots from PyOpenCL's `git repository `__. Version 2022.2 -------------- - Added :ref:`opaque-style SVM ` and :class:`pyopencl.SVMPointer`. - Added :class:`pyopencl.tools.SVMPool`. - Added automatic queue-synchronized deallocation of SVM. Version 2020.2 -------------- - Drop Python 2 support. - Add ``allow_empty_ndrange`` to kernel enqueue. - Bug fixes. Version 2018.2 -------------- * Use pybind11. * Many bug fixes. * Support arrays with offsets in scan kernels. Version 2018.1 -------------- * Introduce *eliminate_empty_output_lists* argument of :class:`pyopencl.algorithm.ListOfListsBuilder`. * Many bug fixes. Version 2017.2 -------------- * Many bug fixes. Version 2017.1 -------------- * Introduce :mod:`pyopencl.cltypes` Version 2016.2 -------------- * Deprecate RANLUXCL. It will be removed in the 2018.x series of PyOpenCL. * Introduce Random123 random number generators. See :mod:`pyopencl.clrandom` for more information. * Add support for **range** and **slice** kwargs and data-less reductions to :class:`pyopencl.reduction.ReductionKernel`. * Add support for SPIR-V. (See :class:`pyopencl.Program`.) * Add support for :ref:`svm`. * :class:`pyopencl.MemoryMap` is usable as a context manager. Version 2016.1 -------------- * The ``from_int_ptr`` methods now take a *retain* parameter for more convenient ownership management. * Kernel build options (if passed as a list) are now properly quoted. (This is a potentially compatibility-breaking change.) * Many bug fixes. (GL interop, Windows, event callbacks and more) Version 2015.2.4 ---------------- * Fix building on Windows, using mingwpy and VS 2015. Version 2015.2.3 ---------------- * Fix one more Ubuntu 14.x build issue. Version 2015.2.2 ---------------- * Fix compatibility with CL 1.1 * Fix compatibility with Ubuntu 14.x. * Various bug fixes Version 2015.2.1 ---------------- * Fix global_offset kernel launch parameter Version 2015.2 -------------- * **[INCOMPATIBLE]** Changed PyOpenCL's complex numbers from ``float2`` and ``double2`` OpenCL vector types to custom ``struct``. This was changed because it very easily introduced bugs where * complex*complex * real+complex *look* like they may do the right thing, but silently do the wrong thing. * Rewrite of the wrapper layer to be based on CFFI * Pypy compatibility * Faster kernel invocation through Python launcher code generation * PoCL compatibility Version 2015.1 -------------- * Support for new-style buffer protocol * Numerous fixes Version 2014.1 -------------- * :ref:`ipython-integration` * Bug fixes Version 2013.2 -------------- * Add :meth:`pyopencl.array.Array.map_to_host`. * Support *strides* on :func:`pyopencl.enqueue_map_buffer` and :func:`pyopencl.enqueue_map_image`. * :class:`pyopencl.ImageFormat` was made comparable and hashable. * :mod:`pyopencl.reduction` supports slicing (contributed by Alex Nitz) * Added :ref:`interoperability` * Bug fixes Version 2013.1 -------------- * Vastly improved :ref:`custom-scan`. * Add :func:`pyopencl.tools.match_dtype_to_c_struct`, for better integration of the CL and :mod:`numpy` type systems. * More/improved Bessel functions. See `the source `__. * Add :envvar:`PYOPENCL_NO_CACHE` environment variable to aid debugging. (e.g. with AMD's CPU implementation, see `their programming guide `__) * Deprecated :func:`pyopencl.tools.register_dtype` in favor of :func:`pyopencl.tools.get_or_register_dtype`. * Clean up the :class:`pyopencl.array.Array` constructor interface. * Deprecate ``pyopencl.array.DefaultAllocator``. * Deprecate ``pyopencl.tools.CLAllocator`` * Introduce :class:`pyopencl.tools.DeferredAllocator`, :class:`pyopencl.tools.ImmediateAllocator`. * Allow arrays whose beginning does not coincide with the beginning of their :attr:`pyopencl.array.Array.data` :class:`pyopencl.Buffer`. See :attr:`pyopencl.array.Array.base_data` and :attr:`pyopencl.array.Array.offset`. Note that not all functions in PyOpenCL support such arrays just yet. These will fail with :exc:`pyopencl.array.ArrayHasOffsetError`. * Add :meth:`pyopencl.array.Array.__getitem__` and :meth:`pyopencl.array.Array.__setitem__`, supporting generic slicing. It is *possible* to create non-contiguous arrays using this functionality. Most operations (elementwise etc.) will not work on such arrays. Note also that some operations (specifically, reductions and scans) on sliced arrays that start past the beginning of the original array will fail for now. This will be fixed in a future release. * :class:`pyopencl.CommandQueue` may be used as a context manager (in a ``with`` statement) * Add :func:`pyopencl.clmath.atan2`, :func:`pyopencl.clmath.atan2pi`. * Add :func:`pyopencl.array.concatenate`. * Add :meth:`pyopencl.Kernel.capture_call`. .. note:: The addition of :meth:`pyopencl.array.Array.__getitem__` has an unintended consequence due to `numpy bug 3375 `__. For instance, this expression:: numpy.float32(5) * some_pyopencl_array may take a very long time to execute. This is because :mod:`numpy` first builds an object array of (compute-device) scalars (!) before it decides that that's probably not such a bright idea and finally calls ``pyopencl.array.Array.__rmul__``. Note that only left arithmetic operations of :class:`pyopencl.array.Array` by :mod:`numpy` scalars are affected. Python's number types (:class:`float` etc.) are unaffected, as are right multiplications. If a program that used to run fast suddenly runs extremely slowly, it is likely that this bug is to blame. Here's what you can do: * Use Python scalars instead of :mod:`numpy` scalars. * Switch to right multiplications if possible. * Use a patched :mod:`numpy`. See the bug report linked above for a pull request with a fix. * Switch to a fixed version of :mod:`numpy` when available. Version 2012.1 -------------- * Support for complex numbers. * Support for Bessel functions. (experimental) * Numerous fixes. Version 2011.2 -------------- * Add :func:`pyopencl.enqueue_migrate_mem_objects`. * Add :func:`pyopencl.image_from_array`. * IMPORTANT BUGFIX: Kernel caching was broken for all the 2011.1.x releases, with severe consequences on the execution time of :class:`pyopencl.array.Array` operations. Henrik Andresen at a `PyOpenCL workshop at DTU `__ first noticed the strange timings. * All comparable PyOpenCL objects are now also hashable. * Add ``pyopencl.tools.context_dependent_memoize`` to the documented functionality. * Base :mod:`pyopencl.clrandom` on RANLUXCL (``https://bitbucket.org/ivarun/ranluxcl>``), add functionality. * Add :class:`pyopencl.NannyEvent` objects. * Add :mod:`pyopencl.characterize`. * Ensure compatibility with OS X Lion. * Add :func:`pyopencl.tools.register_dtype` to enable scan/reduction on struct types. * :func:`pyopencl.enqueue_migrate_mem_objects` was renamed ``pyopencl.enqueue_migrate_mem_objects_ext``. :func:`pyopencl.enqueue_migrate_mem_objects` now refers to the OpenCL 1.2 function of this name, if available. * :meth:`pyopencl.Device.create_sub_devices` was renamed ``pyopencl.Device.create_sub_devices_ext``. :meth:`pyopencl.Device.create_sub_devices` now refers to the OpenCL 1.2 function of this name, if available. * Alpha support for OpenCL 1.2. Version 2011.1.2 ---------------- * More bug fixes. Version 2011.1.1 ---------------- * Fixes for Python 3 compatibility. (with work by Christoph Gohlke) Version 2011.1 -------------- * All *is_blocking* parameters now default to *True* to avoid crashy-by-default behavior. (suggested by Jan Meinke) In particular, this change affects ``pyopencl.enqueue_read_buffer``, ``pyopencl.enqueue_write_buffer``, ``pyopencl.enqueue_read_buffer_rect``, ``pyopencl.enqueue_write_buffer_rect``, ``pyopencl.enqueue_read_image``, ``pyopencl.enqueue_write_image``, ``pyopencl.enqueue_map_buffer``, ``pyopencl.enqueue_map_image``. * Add :mod:`pyopencl.reduction`. * Add :ref:`reductions`. * Add :mod:`pyopencl.scan`. * Add :meth:`pyopencl.MemoryObject.get_host_array`. * Deprecate context arguments of :func:`pyopencl.array.to_device`, :func:`pyopencl.array.zeros`, :func:`pyopencl.array.arange`. * Make construction of :class:`pyopencl.array.Array` more flexible (*cqa* argument.) * Add :ref:`memory-pools`. * Add vector types, see :class:`pyopencl.array.vec`. * Add :attr:`pyopencl.array.Array.strides`, :attr:`pyopencl.array.Array.flags`. Allow the creation of arrays in C and Fortran order. * Add :func:`pyopencl.enqueue_copy`. Deprecate all other transfer functions. * Add support for numerous extensions, among them device fission. * Add a compiler cache. * Add the 'g_times_l' keyword arg to kernel execution. Version 0.92 ------------ * Add support for OpenCL 1.1. * Add support for the `cl_khr_gl_sharing `__ extension, leading to working GL interoperability. * Add :meth:`pyopencl.Kernel.set_args`. * The call signature of :meth:`pyopencl.Kernel.__call__` changed to emphasize the importance of *local_size*. * Add :meth:`pyopencl.Kernel.set_scalar_arg_dtypes`. * Add support for the `cl_nv_device_attribute_query `__ extension. * Add :meth:`pyopencl.array.Array` and related functionality. * Make build not depend on Boost C++. Version 0.91.5 -------------- * Add :attr:`pyopencl.ImageFormat.channel_count`, :attr:`pyopencl.ImageFormat.dtype_size`, :attr:`pyopencl.ImageFormat.itemsize`. * Add missing ``pyopencl.enqueue_copy_buffer``. * Add :func:`pyopencl.create_some_context`. * Add :func:`pyopencl.enqueue_barrier`, which was previously missing. Version 0.91.4 -------------- A bugfix release. No user-visible changes. Version 0.91.3 -------------- * All parameters named *host_buffer* were renamed *hostbuf* for consistency with the :class:`pyopencl.Buffer` constructor introduced in 0.91. Compatibility code is in place. * The :class:`pyopencl.Image` constructor does not need a *shape* parameter if the given *hostbuf* has *hostbuf.shape*. * The :class:`pyopencl.Context` constructor can now be called without parameters. Version 0.91.2 -------------- * :meth:`pyopencl.Program.build` now captures build logs and adds them to the exception text. * Deprecate ``pyopencl.create_context_from_type`` in favor of second form of :class:`pyopencl.Context` constructor * Introduce :class:`pyopencl.LocalMemory`. * Document kernel invocation and :meth:`pyopencl.Kernel.set_arg`. Version 0.91.1 -------------- * Fixed a number of bugs, notably involving :class:`pyopencl.Sampler`. * :class:`pyopencl.Device`, :class:`pyopencl.Platform`, :class:`pyopencl.Context` now have nicer string representations. * Add :attr:`pyopencl.Image.shape`. (suggested by David Garcia) Version 0.91 ------------ * Add :ref:`gl-interop`. * Add a test suite. * Fix numerous ``get_info`` bugs. (reports by David Garcia and the test suite) * Add :meth:`pyopencl.ImageFormat.__repr__`. * Add :meth:`pyopencl.addressing_mode.to_string` and colleagues. * The ``pitch`` arguments to ``pyopencl.create_image_2d``, ``pyopencl.create_image_3d``, ``pyopencl.enqueue_read_image``, and ``pyopencl.enqueue_write_image`` are now defaulted to zero. The argument order of ``enqueue_{read,write}_image`` has changed for this reason. * Deprecate ``pyopencl.create_image_2d``, ``pyopencl.create_image_3d`` in favor of the :class:`pyopencl.Image` constructor. * Deprecate ``pyopencl.create_program_with_source``, ``pyopencl.create_program_with_binary`` in favor of the :class:`pyopencl.Program` constructor. * Deprecate ``pyopencl.create_buffer``, ``pyopencl.create_host_buffer`` in favor of the :class:`pyopencl.Buffer` constructor. * :meth:`pyopencl.Image.get_image_info` now actually exists. * Add :attr:`pyopencl.Image.info`. * Fix API tracing. * Add constructor arguments to :class:`pyopencl.ImageFormat`. (suggested by David Garcia) Version 0.90.4 -------------- * Add build fixes for Windows and OS X. Version 0.90.3 -------------- * Fix a GNU-ism in the C++ code of the wrapper. Version 0.90.2 -------------- * Fix :meth:`pyopencl.Platform.get_info`. * Fix passing properties to :class:`pyopencl.CommandQueue`. Also fix related documentation. Version 0.90.1 -------------- * Fix building on the Mac. Version 0.90 ------------ * Initial release. .. _license: License ======= .. include:: ../LICENSE Frequently Asked Questions ========================== The FAQ is maintained collaboratively on the `Wiki FAQ page `__. Citing PyOpenCL =============== We are not asking you to gratuitously cite PyOpenCL in work that is otherwise unrelated to software. That said, if you do discuss some of the development aspects of your code and would like to highlight a few of the ideas behind PyOpenCL, feel free to cite `this article `__: Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, Ahmed Fasih, PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation, Parallel Computing, Volume 38, Issue 3, March 2012, Pages 157-174. Here's a Bibtex entry for your convenience:: @article{kloeckner_pycuda_2012, author = {{Kl{\"o}ckner}, Andreas and {Pinto}, Nicolas and {Lee}, Yunsup and {Catanzaro}, B. and {Ivanov}, Paul and {Fasih}, Ahmed }, title = "{PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation}", journal = "Parallel Computing", volume = "38", number = "3", pages = "157--174", year = "2012", issn = "0167-8191", doi = "10.1016/j.parco.2011.09.001", } Acknowledgments =============== Contributors ------------ Too many to list. Please see the `commit log `__ for detailed acknowledgments. Funding ------- Work on pytential was supported in part by * the US National Science Foundation under grant numbers DMS-1418961, DMS-1654756, SHF-1911019, and OAC-1931577, and * the Department of Energy, National Nuclear Security Administration, under Award Number DE-NA0003963. AK also gratefully acknowledges a hardware gift from Nvidia Corporation. The views and opinions expressed herein do not necessarily reflect those of the funding agencies. Documentation Cross-References ============================== Numpy ----- .. currentmodule:: numpy .. class:: int8 See :class:`numpy.generic`. .. class:: int32 See :class:`numpy.generic`. .. class:: float64 See :class:`numpy.generic`. OpenCL Specification -------------------- .. c:type:: cl_platform_id See the `CL specification `__. .. c:type:: cl_device_id See the `CL specification `__. .. c:type:: cl_context See the `CL specification `__. .. c:type:: cl_command_queue See the `CL specification `__. .. c:type:: cl_mem See the `CL specification `__. .. c:type:: cl_program See the `CL specification `__. .. c:type:: cl_kernel See the `CL specification `__. .. c:type:: cl_sampler See the `CL specification `__. .. c:type:: cl_event See the `CL specification `__. .. c:function:: void clCreateCommandQueueWithProperties() See the `CL specification `__. .. c:function:: void clCreateSamplerWithProperties() See the `CL specification `__. .. c:function:: void clCreatePipe() See the `CL specification `__. Internal Types -------------- .. currentmodule:: pyopencl._cl .. class:: Platform See :class:`pyopencl.Platform`. .. class:: Device See :class:`pyopencl.Device`. .. class:: CommandQueue See :class:`pyopencl.CommandQueue`. .. class:: Context See :class:`pyopencl.Context`. .. class:: Event See :class:`pyopencl.Event`. .. class:: SVMAllocation See :class:`pyopencl.SVMAllocation`. .. class:: MemoryMap See :class:`pyopencl.MemoryMap`. .. class:: Sampler See :class:`pyopencl.Sampler`. .. class:: Program See :class:`pyopencl.Program`. .. class:: _Program See :class:`pyopencl.Program`. .. class:: Kernel See :class:`pyopencl.Kernel`. pyopencl-2025.1/doc/runtime.rst0000644000000000000000000000171014332717401013345 0ustar00.. _reference-doc: .. include:: subst.rst OpenCL Runtime: Basics ====================== Version Queries --------------- .. module:: pyopencl .. moduleauthor:: Andreas Kloeckner .. data:: VERSION Gives the numeric version of PyOpenCL as a variable-length tuple of integers. Enables easy version checks such as ``VERSION >= (0, 93)``. .. data:: VERSION_STATUS A text string such as ``"rc4"`` or ``"beta"`` qualifying the status of the release. .. data:: VERSION_TEXT The full release name (such as ``"0.93rc4"``) in string form. .. function:: get_cl_header_version() Return a variable-length tuple of integers representing the version of the OpenCL header against which PyOpenCL was compiled. .. versionadded:: 0.92 .. _errors: Error Reporting --------------- .. class:: Error Base class for all PyOpenCL exceptions. .. class:: MemoryError .. class:: LogicError .. class:: RuntimeError pyopencl-2025.1/doc/runtime_const.rst0000644000000000000000000000107414332717401014556 0ustar00OpenCL Runtime: Constants ========================= .. currentmodule:: pyopencl .. include:: constants.inc .. class:: NameVersion Describes the version of a specific feature. .. note:: Only available with OpenCL 3.0 or newer. .. versionadded:: 2020.3 .. method:: __init__(version, name) .. attribute:: version .. attribute:: name .. class:: DeviceTopologyAmd .. method:: __init__(bus, device, function) .. attribute:: type .. attribute:: bus .. attribute:: device .. attribute:: function .. vim: shiftwidth=4 pyopencl-2025.1/doc/runtime_gl.rst0000644000000000000000000000423614332717401014035 0ustar00.. include:: subst.rst .. _gl-interop: OpenCL Runtime: OpenGL Interoperability ======================================= .. currentmodule:: pyopencl Functionality in this section is only available when PyOpenCL is compiled with GL support. See :func:`have_gl`. .. versionadded:: 0.91 .. function:: have_gl() Return *True* if PyOpenCL was compiled with OpenGL interoperability, otherwise *False*. .. function:: get_gl_sharing_context_properties() Return a :class:`list` of :class:`context_properties` that will allow a newly created context to share the currently active GL context. .. function:: get_apple_cgl_share_group() Get share group handle for current CGL context. Apple OS X only. .. versionadded:: 2011.1 .. class:: GLBuffer(context, flags, bufobj) :class:`GLBuffer` inherits from :class:`MemoryObject`. .. attribute:: gl_object .. class:: GLRenderBuffer(context, flags, bufobj) :class:`GLRenderBuffer` inherits from :class:`MemoryObject`. .. attribute:: gl_object .. class:: GLTexture(context, flags, texture_target, miplevel, texture, dims) :class:`GLTexture` inherits from :class:`Image`. Only available in OpenCL 1.2 and newer. .. attribute:: gl_object .. method:: get_gl_texture_info(param) See ``gl_texture_info`` for values of *param*. Only available when PyOpenCL is compiled with GL support. See :func:`have_gl`. .. function:: enqueue_acquire_gl_objects(queue, mem_objects, wait_for=None) *mem_objects* is a list of :class:`MemoryObject` instances. |std-enqueue-blurb| .. function:: enqueue_release_gl_objects(queue, mem_objects, wait_for=None) *mem_objects* is a list of :class:`MemoryObject` instances. |std-enqueue-blurb| .. function:: get_gl_context_info_khr(properties, param_name, platform=None) Get information on which CL device corresponds to a given GL/EGL/WGL/CGL device. See the :class:`Context` constructor for the meaning of *properties* and :class:`gl_context_info` for *param_name*. .. versionchanged:: 2011.2 Accepts the *platform* argument. Using *platform* equal to None is deprecated as of PyOpenCL 2011.2. pyopencl-2025.1/doc/runtime_memory.rst0000644000000000000000000003714014332717401014743 0ustar00.. include:: subst.rst OpenCL Runtime: Memory ====================== .. currentmodule:: pyopencl .. class:: MemoryObject .. attribute:: info Lower case versions of the :class:`mem_info` constants may be used as attributes on instances of this class to directly query info attributes. .. attribute:: hostbuf .. method:: get_info(param) See :class:`mem_info` for values of *param*. .. method:: release() .. method:: get_host_array(shape, dtype, order="C") Return the memory object's associated host memory area as a :class:`numpy.ndarray` of the given *shape*, *dtype* and *order*. .. automethod:: from_int_ptr .. autoattribute:: int_ptr |comparable| Memory Migration ---------------- .. function:: enqueue_migrate_mem_objects(queue, mem_objects, flags=0, wait_for=None) :param flags: from :class:`mem_migration_flags` .. versionadded:: 2011.2 Only available with CL 1.2. Buffer ------ .. class:: Buffer(context, flags, size=0, hostbuf=None) Create a :class:`Buffer`. See :class:`mem_flags` for values of *flags*. If *hostbuf* is specified, *size* defaults to the size of the specified buffer if it is passed as zero. :class:`Buffer` inherits from :class:`MemoryObject`. .. note:: Python also defines a type of `buffer object `__, and PyOpenCL interacts with those, too, as the host-side target of :func:`enqueue_copy`. Make sure to always be clear on whether a :class:`Buffer` or a Python buffer object is needed. Note that actual memory allocation in OpenCL may be deferred. Buffers are attached to a :class:`Context` and are only moved to a device once the buffer is used on that device. That is also the point when out-of-memory errors will occur. If you'd like to be sure that there's enough memory for your allocation, either use :func:`enqueue_migrate_mem_objects` (if available) or simply perform a small transfer to the buffer. See also :class:`pyopencl.tools.ImmediateAllocator`. .. method:: get_sub_region(origin, size, flags=0) Only available in OpenCL 1.1 and newer. .. method:: __getitem__(slc) *slc* is a :class:`slice` object indicating from which byte index range a sub-buffer is to be created. The *flags* argument of :meth:`get_sub_region` is set to the same flags with which *self* was created. .. function:: enqueue_fill_buffer(queue, mem, pattern, offset, size, wait_for=None) :arg mem: the on device :class:`Buffer` :arg pattern: a buffer object (likely a :class:`numpy.ndarray`, eg. ``np.uint32(0)``). The memory associated with *pattern* can be reused or freed once the function completes. :arg size: The size in bytes of the region to be filled. Must be a multiple of the size of the pattern. :arg offset: The location in bytes of the region being filled in *mem*. Must be a multiple of the size of the pattern. Fills a buffer with the provided pattern |std-enqueue-blurb| Only available with CL 1.2. .. versionadded:: 2011.2 .. _svm: Shared Virtual Memory (SVM) --------------------------- Shared virtual memory allows the host and the compute device to share address space, so that pointers on the host and on the device may have the same meaning. In addition, it allows the same memory to be accessed by both the host and the device. *Coarse-grain* SVM requires that buffers be mapped before being accessed on the host, *fine-grain* SVM does away with that requirement. .. warning:: Compared to :class:`Buffer`\ s, SVM brings with it a new concern: the synchronization of memory deallocation. Unlike other objects in OpenCL, SVM is represented by a plain (C-language) pointer and thus has no ability for reference counting. As a result, it is perfectly legal to allocate a :class:`Buffer`, enqueue an operation on it, and release the buffer, without worrying about whether the operation has completed. The OpenCL implementation will keep the buffer alive until the operation has completed. This is *not* the case with SVM: Unless otherwise specified, memory deallocation is performed immediately when requested, and so SVM will be deallocated whenever the Python garbage collector sees fit, even if the operation has not completed, immediately leading to undefined behavior (i.e., typically, memory corruption and, before too long, a crash). Version 2022.2 of PyOpenCL offers substantially improved tools for dealing with this. In particular, all means for allocating SVM allow specifying a :class:`CommandQueue`, so that deallocation is enqueued and performed after previously-enqueued operations have completed. SVM requires OpenCL 2.0. .. _opaque-svm: Opaque and "Wrapped-:mod:`numpy`" Styles of Referencing SVM ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When trying to pass SVM pointers to functionality in :mod:`pyopencl`, two styles are supported: - First, the opaque style. This style most closely resembles :class:`Buffer`-based allocation available in OpenCL 1.x. SVM pointers are held in opaque "handle" objects such as :class:`SVMAllocation`. - Second, the wrapped-:mod:`numpy` style. In this case, a :class:`numpy.ndarray` (or another object implementing the :c:func:`Python buffer protocol `) serves as the reference to an area of SVM. This style permits using memory areas with :mod:`pyopencl`'s SVM interfaces even if they were allocated outside of :mod:`pyopencl`. Since passing a :class:`numpy.ndarray` (or another type of object obeying the buffer interface) already has existing semantics in most settings in :mod:`pyopencl` (such as when passing arguments to a kernel or calling :func:`enqueue_copy`), there exists a wrapper object, :class:`SVM`, that may be "wrapped around" these objects to mark them as SVM. The commonality between the two styles is that both ultimately implement the :class:`SVMPointer` interface, which :mod:`pyopencl` uses to obtain the actual SVM pointer. Note that it is easily possible to obtain a :class:`numpy.ndarray` view of SVM areas held in the opaque style, see :attr:`SVMPointer.buf`, permitting transitions from opaque to wrapped-:mod:`numpy` style. The opposite transition (from wrapped-:mod:`numpy` to opaque) is not necessarily straightforward, as it would require "fishing" the opaque SVM handle out of a chain of :attr:`numpy.ndarray.base` attributes (or similar, depending on the actual object serving as the main SVM reference). See :ref:`numpy-svm-helpers` for helper functions that ease setting up the wrapped-:mod:`numpy` structure. Wrapped-:mod:`numpy` SVM tends to be a good fit for fine-grain SVM because of the ease of direct host-side access, but the creation of the nested structure that makes this possible is associated with a certain amount of cost. By comparison, opaque SVM access tends to be a good fit for coarse-grain SVM, because direct host access is not possible without mapping the array anyway, and it has lower setup cost. It is of course entirely possible to use opaque SVM access with fine-grain SVM. .. versionchanged:: 2022.2 This version adds the opaque style of SVM access. Using SVM with Arrays ^^^^^^^^^^^^^^^^^^^^^ While all types of SVM can be used as the memory backing :class:`pyopencl.array.Array` objects, ensuring that new arrays returned by array operations (e.g. arithmetic) also use SVM is easiest to accomplish by passing an :class:`~pyopencl.tools.SVMAllocator` (or :class:`~pyopencl.tools.SVMPool`) as the *allocator* parameter in functions returning new arrays. SVM Pointers, Allocations, and Maps ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. autoclass:: SVMPointer .. autoclass:: SVMAllocation .. autoclass:: SVM .. autoclass:: SVMMap .. _numpy-svm-helpers: Helper functions for :mod:`numpy`-based SVM allocation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. autofunction:: svm_empty .. autofunction:: svm_empty_like .. autofunction:: csvm_empty .. autofunction:: csvm_empty_like .. autofunction:: fsvm_empty .. autofunction:: fsvm_empty_like Operations on SVM ^^^^^^^^^^^^^^^^^ (See also :ref:`mem-transfer`.) .. autofunction:: enqueue_svm_memfill .. autofunction:: enqueue_svm_migratemem Image ----- .. class:: ImageFormat(channel_order, channel_type) .. attribute:: channel_order See :class:`channel_order` for possible values. .. attribute:: channel_data_type See :class:`channel_type` for possible values. .. attribute:: channel_count .. versionadded:: 0.91.5 .. attribute:: dtype_size .. versionadded:: 0.91.5 .. attribute:: itemsize .. versionadded:: 0.91.5 .. method:: __repr__ Returns a :class:`str` representation of the image format. .. versionadded:: 0.91 |comparable| .. versionchanged:: 0.91 Constructor arguments added. .. versionchanged:: 2013.2 :class:`ImageFormat` was made comparable and hashable .. function:: get_supported_image_formats(context, flags, image_type) See :class:`mem_flags` for possible values of *flags* and :class:`mem_object_type` for possible values of *image_type*. .. class:: Image Use :func:`create_image` to create images. .. versionadded:: 0.91 .. attribute:: info Lower case versions of the :class:`mem_info` and :class:`image_info` constants may be used as attributes on instances of this class to directly query info attributes. .. attribute:: shape Return the value of the *shape* constructor argument as a :class:`tuple`. .. method:: get_image_info(param) See :class:`image_info` for values of *param*. .. method:: release() |comparable| .. autofunction:: create_image .. function:: image_from_array(ctx, ary, num_channels=None, mode="r", norm_int=False) Build a 2D or 3D :class:`Image` from the :class:`numpy.ndarray` *ary*. If *num_channels* is greater than one, the last dimension of *ary* must be identical to *num_channels*. *ary* must be in C order. If *num_channels* is not given, it defaults to 1 for scalar types and the number of entries for :ref:`vector-types`. The :class:`ImageFormat` is chosen as the first *num_channels* components of "RGBA". :param mode: "r" or "w" for read/write .. note:: When reading from the image object, the indices passed to ``read_imagef`` are in the reverse order from what they would be when accessing *ary* from Python. If *norm_int* is *True*, then the integer values are normalized to a floating point scale of 0..1 when read. .. versionadded:: 2011.2 .. function:: enqueue_fill_image(queue, mem, color, origin, region, wait_for=None) :arg color: a buffer object (likely a :class:`numpy.ndarray`) |std-enqueue-blurb| Only available with CL 1.2. .. versionadded:: 2011.2 .. _mem-transfer: Transfers --------- .. autofunction:: enqueue_copy(queue, dest, src, **kwargs) .. autofunction:: enqueue_fill(queue, dest, src, **kwargs) .. function:: enqueue_copy_buffer_p2p_amd(platform, queue, src, dest, size=None, wait_for=None) AMD extension to perform a peer-to-peer copy between two buffers on two different devices. The two devices must be in different contexts. The queue must be where the source buffer is located. :arg platform: a :class:`Platform` instance :arg queue: a :class:`CommandQueue` instance :arg src: a :class:`Buffer` instance :arg dest: a :class:`Buffer` instance :param size: the number of bytes to copy. If *None*, the minimum of the sizes of the two buffers is used. |std-enqueue-blurb| Only available on AMD platforms. .. versionadded:: 2023.1.2 Mapping Memory into Host Address Space -------------------------------------- .. autoclass:: MemoryMap .. function:: enqueue_map_buffer(queue, buf, flags, offset, shape, dtype, order="C", strides=None, wait_for=None, is_blocking=True) |explain-waitfor| *shape*, *dtype*, and *order* have the same meaning as in :func:`numpy.empty`. See :class:`map_flags` for possible values of *flags*. *strides*, if given, overrides *order*. :return: a tuple *(array, event)*. *array* is a :class:`numpy.ndarray` representing the host side of the map. Its *.base* member contains a :class:`MemoryMap`. .. versionchanged:: 2011.1 *is_blocking* now defaults to True. .. versionchanged:: 2013.1 *order* now defaults to "C". .. versionchanged:: 2013.2 Added *strides* argument. Sample usage:: mapped_buf = cl.enqueue_map_buffer(queue, buf, ...) with mapped_buf.base: # work with mapped_buf ... # memory will be unmapped here .. function:: enqueue_map_image(queue, buf, flags, origin, region, shape, dtype, order="C", strides=None, wait_for=None, is_blocking=True) |explain-waitfor| *shape*, *dtype*, and *order* have the same meaning as in :func:`numpy.empty`. See :class:`map_flags` for possible values of *flags*. *strides*, if given, overrides *order*. :return: a tuple *(array, event)*. *array* is a :class:`numpy.ndarray` representing the host side of the map. Its *.base* member contains a :class:`MemoryMap`. .. versionchanged:: 2011.1 *is_blocking* now defaults to True. .. versionchanged:: 2013.1 *order* now defaults to "C". .. versionchanged:: 2013.2 Added *strides* argument. Samplers -------- .. class:: Sampler .. method:: __init__(context, normalized_coords, addressing_mode, filter_mode) *normalized_coords* is a :class:`bool` indicating whether to use coordinates between 0 and 1 (*True*) or the texture's natural pixel size (*False*). See :class:`addressing_mode` and :class:`filter_mode` for possible argument values. Also supports an alternate signature ``(context, properties)``. :arg properties: a sequence of keys and values from :class:`sampler_properties` as accepted by :c:func:`clCreateSamplerWithProperties` (see the OpenCL spec for details). The trailing *0* is added automatically and does not need to be included. This signature Requires OpenCL 2 or newer. .. versionchanged:: 2018.2 The properties-based signature was added. .. attribute:: info Lower case versions of the :class:`sampler_info` constants may be used as attributes on instances of this class to directly query info attributes. .. method:: get_info(param) See :class:`sampler_info` for values of *param*. .. automethod:: from_int_ptr .. autoattribute:: int_ptr |comparable| Pipes ----- .. class:: Pipe(context, flags, packet_size, max_packets, properties=()) See :class:`mem_flags` for values of *flags*. :arg properties: a sequence of keys and values from :class:`pipe_properties` as accepted by :c:func:`clCreatePipe`. The trailing *0* is added automatically and does not need to be included. (This argument must currently be empty.) This function requires OpenCL 2 or newer. .. versionadded:: 2020.3 .. versionchanged:: 2021.1.7 *properties* now defaults to an empty tuple. .. method:: get_pipe_info(param) See :class:`pipe_info` for values of *param*. Type aliases ------------ .. currentmodule:: pyopencl._cl .. class:: Buffer See :class:`pyopencl.Buffer`. .. class:: Image See :class:`pyopencl.Image`. pyopencl-2025.1/doc/runtime_platform.rst0000644000000000000000000001322014332717401015250 0ustar00.. include:: subst.rst OpenCL Runtime: Platforms, Devices and Contexts =============================================== .. currentmodule:: pyopencl Platform -------- .. function:: get_platforms() Return a list of :class:`Platform` instances. .. class:: Platform .. attribute:: info Lower case versions of the :class:`platform_info` constants may be used as attributes on instances of this class to directly query info attributes. .. method:: get_info(param) See :class:`platform_info` for values of *param*. .. method:: get_devices(device_type=device_type.ALL) Return a list of devices matching *device_type*. See :class:`device_type` for values of *device_type*. .. versionchanged:: 2013.2 This used to raise an exception if no matching devices were found. Now, it will simply return an empty list. .. automethod:: from_int_ptr .. autoattribute:: int_ptr |comparable| Device ------ .. class:: Device Two instances of this class may be compared using *=="* and *"!="*. .. attribute:: info Lower case versions of the :class:`device_info` constants may be used as attributes on instances of this class to directly query info attributes. .. method:: get_info(param) See :class:`device_info` for values of *param*. .. automethod:: from_int_ptr .. autoattribute:: int_ptr .. attribute :: hashable_model_and_version_identifier An unspecified data type that can be used to (as precisely as possible, given identifying information available in OpenCL) identify a given model and software stack version of a compute device. Note that this identifier does not differentiate between different instances of the same device installed in a single host. The returned data type is hashable. .. versionadded:: 2020.1 .. method:: create_sub_devices(properties) *properties* is an array of one (or more) of the forms:: [ dpp.EQUALLY, 8] [ dpp.BY_COUNTS, 5, 7, 9, dpp.PARTITION_BY_COUNTS_LIST_END] [ dpp.BY_NAMES, 5, 7, 9, dpp.PARTITION_BY_NAMES_LIST_END] [ dpp.BY_AFFINITY_DOMAIN, dad.L1_CACHE] where ``dpp`` represents :class:`device_partition_property` and ``dad`` represent :class:`device_affinity_domain`. ``PROPERTIES_LIST_END_EXT`` is added automatically. Only available with CL 1.2. .. versionadded:: 2011.2 .. method:: device_and_host_timer :returns: a tuple ``(device_timestamp, host_timestamp)``. Only available with CL 2.0. .. versionadded:: 2020.3 .. method:: host_timer Only available with CL 2.0. .. versionadded:: 2020.3 .. autofunction:: choose_devices Context ------- .. class:: Context(devices=None, properties=None, dev_type=None) Create a new context. *properties* is a list of key-value tuples, where each key must be one of :class:`context_properties`. At most one of *devices* and *dev_type* may be not *None*, where *devices* is a list of :class:`Device` instances, and *dev_type* is one of the :class:`device_type` constants. If neither is specified, a context with a *dev_type* of :attr:`device_type.DEFAULT` is created. .. note:: Calling the constructor with no arguments may fail for CL drivers that support the OpenCL ICD (which applies to most modern systems). If you want similar, just-give-me-a-context-already behavior, we recommend :func:`create_some_context`. See e.g. this `explanation by AMD `__: **What has changed?** In previous beta releases functions such as clGetDeviceIDs() and clCreateContext() accepted a NULL value for the platform parameter. This release no longer allows this - the platform must be a valid one obtained by using the platform API. .. note:: Because of how OpenCL changed in order to support Installable Client Drivers (ICDs) in OpenCL 1.1, the following will *look* reasonable but often actually not work:: import pyopencl as cl ctx = cl.Context(dev_type=cl.device_type.ALL) Instead, make sure to choose a platform when choosing a device by type:: import pyopencl as cl platforms = cl.get_platforms() ctx = cl.Context( dev_type=cl.device_type.ALL, properties=[(cl.context_properties.PLATFORM, platforms[0])]) .. note:: For ``context_properties.CL_GL_CONTEXT_KHR``, ``context_properties.CL_EGL_DISPLAY_KHR``, ``context_properties.CL_GLX_DISPLAY_KHR``, ``context_properties.CL_WGL_HDC_KHR``, and ``context_properties.CL_CGL_SHAREGROUP_KHR`` ``context_properties.CL_CGL_SHAREGROUP_APPLE`` the value in the key-value pair is a PyOpenGL context or display instance. .. versionchanged:: 0.91.2 Constructor arguments *dev_type* added. .. attribute:: info Lower case versions of the :class:`context_info` constants may be used as attributes on instances of this class to directly query info attributes. .. method:: get_info(param) See :class:`context_info` for values of *param*. .. automethod:: from_int_ptr .. autoattribute:: int_ptr .. method:: set_default_device_command_queue(dev, queue) |comparable| .. autofunction:: create_some_context pyopencl-2025.1/doc/runtime_program.rst0000644000000000000000000003131614332717401015101 0ustar00.. include:: subst.rst OpenCL Runtime: Programs and Kernels ==================================== .. currentmodule:: pyopencl Program ------- .. envvar:: PYOPENCL_NO_CACHE By default, PyOpenCL will use cached (on disk) "binaries" returned by the OpenCL runtime when calling :meth:`Program.build` on a program constructed with source. (It will depend on the ICD in use how much compilation work is saved by this.) By setting the environment variable :envvar:`PYOPENCL_NO_CACHE` to any string that :func:`pytools.strtobool` evaluates as ``True``, this caching is suppressed. No additional in-memory caching is performed. To retain the compiled version of a kernel in memory, simply retain the :class:`Program` and/or :class:`Kernel` objects. PyOpenCL will also cache "invokers", which are short snippets of Python that are generated to accelerate passing arguments to and enqueuing a kernel. .. versionadded:: 2013.1 .. envvar:: PYOPENCL_COMPILER_OUTPUT When setting the environment variable :envvar:`PYOPENCL_COMPILER_OUTPUT` to any string that :func:`pytools.strtobool` evaluates as ``True``, PyOpenCL will show compiler messages emitted during program build. .. envvar:: PYOPENCL_BUILD_OPTIONS Any options found in the environment variable :envvar:`PYOPENCL_BUILD_OPTIONS` will be appended to *options* in :meth:`Program.build`. .. versionadded:: 2013.1 .. class:: Program(context, src) Program(context, devices, binaries) *binaries* must contain one binary for each entry in *devices*. If *src* is a :class:`bytes` object starting with a valid `SPIR-V `__ magic number, it will be handed off to the OpenCL implementation as such, rather than as OpenCL C source code. (SPIR-V support requires OpenCL 2.1.) .. versionchanged:: 2016.2 Add support for SPIR-V. .. attribute:: info Lower case versions of the :class:`program_info` constants may be used as attributes on instances of this class to directly query info attributes. .. method:: get_info(param) See :class:`program_info` for values of *param*. .. method:: get_build_info(device, param) See :class:`program_build_info` for values of *param*. .. method:: build(options=[], devices=None, cache_dir=None) *options* is a string of compiler flags. Returns *self*. If *cache_dir* is not None - built binaries are cached in an on-disk cache with given path. If passed *cache_dir* is None, but context of this program was created with not-None cache_dir - it will be used as cache directory. If passed *cache_dir* is None and context was created with None cache_dir: built binaries will be cached in an on-disk cache called :file:`pyopencl-compiler-cache-vN-uidNAME-pyVERSION` in the directory returned by :func:`tempfile.gettempdir`. See also :envvar:`PYOPENCL_NO_CACHE`, :envvar:`PYOPENCL_BUILD_OPTIONS`. .. versionchanged:: 2011.1 *options* may now also be a :class:`list` of :class:`str`. .. method:: compile(self, options=[], devices=None, headers=[]) :param headers: a list of tuples *(name, program)*. Only available with CL 1.2. .. versionadded:: 2011.2 .. attribute:: kernel_name You may use ``program.kernel_name`` to obtain a :class:`Kernel` object from a program. Note that every lookup of this type produces a new kernel object, so that this **won't** work:: prg.sum.set_args(a_g, b_g, res_g) ev = cl.enqueue_nd_range_kernel(queue, prg.sum, a_np.shape, None) Instead, either use the (recommended, stateless) calling interface:: sum_knl = prg.sum sum_knl(queue, a_np.shape, None, a_g, b_g, res_g) or the long, stateful way around, if you prefer:: sum_knl.set_args(a_g, b_g, res_g) ev = cl.enqueue_nd_range_kernel(queue, sum_knl, a_np.shape, None) The following will also work, however note that a number of caches that are important for efficient kernel enqueue are attached to the :class:`Kernel` object, and these caches will be ineffective in this usage pattern:: prg.sum(queue, a_np.shape, None, a_g, b_g, res_g) Note that the :class:`Program` has to be built (see :meth:`build`) in order for this to work simply by attribute lookup. .. note:: The :class:`program_info` attributes live in the same name space and take precedence over :class:`Kernel` names. .. note:: If you need to retrieve a kernel whose name includes non-identifier characters, retrieving it as an attribute of :class:`~pyopencl.Program` will not work, for obvious reasons. In that case, you can use the :class:`~pyopencl.Kernel` constructor directly. .. method:: all_kernels() Returns a list of all :class:`Kernel` objects in the :class:`Program`. .. method:: set_specialization_constant(spec_id, buffer) Only available with CL 2.2 and newer. .. versionadded:: 2020.3 .. automethod:: from_int_ptr .. autoattribute:: int_ptr |comparable| .. function:: create_program_with_built_in_kernels(context, devices, kernel_names) Only available with CL 1.2. .. versionadded:: 2011.2 .. function:: link_program(context, programs, options=[], devices=None) Only available with CL 1.2. .. versionadded:: 2011.2 .. function:: unload_platform_compiler(platform) Only available with CL 1.2. .. versionadded:: 2011.2 Kernel ------ .. class:: Kernel(program, name) .. attribute:: info Lower case versions of the :class:`kernel_info` constants may be used as attributes on instances of this class to directly query info attributes. .. method:: clone() Only available with CL 2.1. .. versionadded:: 2020.3 .. method:: get_info(param) See :class:`kernel_info` for values of *param*. .. method:: get_work_group_info(param, device) See :class:`kernel_work_group_info` for values of *param*. .. method:: get_arg_info(arg_index, param) See :class:`kernel_arg_info` for values of *param*. Only available in OpenCL 1.2 and newer. .. method:: get_sub_group_info(self, device, param, input_value=None) When the OpenCL spec requests *input_value* to be of type ``size_t``, these may be passed directly as a number. When it requests *input_value* to be of type ``size_t *``, a tuple of integers may be passed. Only available in OpenCL 2.1 and newer. .. versionadded:: 2020.3 .. method:: set_arg(self, index, arg) *arg* may be * *None*: This may be passed for ``__global`` memory references to pass a *NULL* pointer to the kernel. * Anything that satisfies the Python buffer interface, in particular :class:`numpy.ndarray`, :class:`str`, or :mod:`numpy`'s sized scalars, such as :class:`numpy.int32` or :class:`numpy.float64`. .. note:: Note that Python's own :class:`int` or :class:`float` objects will not work out of the box. See :meth:`Kernel.set_scalar_arg_dtypes` for a way to make them work. Alternatively, the standard library module :mod:`struct` can be used to convert Python's native number types to binary data in a :class:`str`. * An instance of :class:`MemoryObject`. (e.g. :class:`Buffer`, :class:`Image`, etc.) * An instance of :class:`LocalMemory`. * An instance of :class:`Sampler`. * An instance of :class:`CommandQueue`. (CL 2.0 and higher only) .. method:: set_args(self, *args) Invoke :meth:`set_arg` on each element of *args* in turn. .. versionadded:: 0.92 .. method:: set_scalar_arg_dtypes(arg_dtypes) Inform the wrapper about the sized types of scalar :class:`Kernel` arguments. For each argument, *arg_dtypes* contains an entry. For non-scalars, this must be *None*. For scalars, it must be an object acceptable to the :class:`numpy.dtype` constructor, indicating that the corresponding scalar argument is of that type. After invoking this function with the proper information, most suitable number types will automatically be cast to the right type for kernel invocation. .. note :: The information set by this method is attached to a single kernel instance. A new kernel instance is created every time you use `program.kernel` attribute access. The following will therefore not work:: prg = cl.Program(...).build() prg.kernel.set_scalar_arg_dtypes(...) prg.kernel(queue, n_globals, None, args) .. method:: __call__(queue, global_size, local_size, *args, global_offset=None, wait_for=None, g_times_l=False, allow_empty_ndrange=False) Use :func:`enqueue_nd_range_kernel` to enqueue a kernel execution, after using :meth:`set_args` to set each argument in turn. See the documentation for :meth:`set_arg` to see what argument types are allowed. |glsize| |empty-nd-range| |std-enqueue-blurb| .. note:: :meth:`__call__` is *not* thread-safe. It sets the arguments using :meth:`set_args` and then runs :func:`enqueue_nd_range_kernel`. Another thread could race it in doing the same things, with undefined outcome. This issue is inherited from the C-level OpenCL API. The recommended solution is to make a kernel (i.e. access ``prg.kernel_name``, which corresponds to making a new kernel) for every thread that may enqueue calls to the kernel. A solution involving implicit locks was discussed and decided against on the mailing list in `October 2012 `__. .. versionchanged:: 0.92 *local_size* was promoted to third positional argument from being a keyword argument. The old keyword argument usage will continue to be accepted with a warning throughout the 0.92 release cycle. This is a backward-compatible change (just barely!) because *local_size* as third positional argument can only be a :class:`tuple` or *None*. :class:`tuple` instances are never valid :class:`Kernel` arguments, and *None* is valid as an argument, but its treatment in the wrapper had a bug (now fixed) that prevented it from working. .. versionchanged:: 2011.1 Added the *g_times_l* keyword arg. .. versionchanged:: 2020.2 Added the *allow_empty_ndrange* keyword argument. .. method:: capture_call(output_file, queue, global_size, local_size, *args, global_offset=None, wait_for=None, g_times_l=False) This method supports the exact same interface as :meth:`__call__`, but instead of invoking the kernel, it writes a self-contained PyOpenCL program to *filename* that reproduces this invocation. Data and kernel source code will be packaged up in *filename*'s source code. This is mainly intended as a debugging aid. For example, it can be used to automate the task of creating a small, self-contained test case for an observed problem. It can also help separate a misbehaving kernel from a potentially large or time-consuming outer code. :arg output_file: a a filename or a file-like to which the generated code is to be written. To use, simply change:: evt = my_kernel(queue, gsize, lsize, arg1, arg2, ...) to:: evt = my_kernel.capture_call("bug.py", queue, gsize, lsize, arg1, arg2, ...) .. versionadded:: 2013.1 .. automethod:: from_int_ptr .. autoattribute:: int_ptr |comparable| .. class:: LocalMemory(size) A helper class to pass ``__local`` memory arguments to kernels. .. versionadded:: 0.91.2 .. attribute:: size The size of local buffer in bytes to be provided. .. function:: enqueue_nd_range_kernel(queue, kernel, global_work_size, local_work_size, global_work_offset=None, wait_for=None, g_times_l=False, allow_empty_ndrange=False) |glsize| |empty-nd-range| |std-enqueue-blurb| .. versionchanged:: 2011.1 Added the *g_times_l* keyword arg. .. versionchanged:: 2020.2 Added the *allow_empty_ndrange* keyword argument. pyopencl-2025.1/doc/runtime_queue.rst0000644000000000000000000001215714332717401014560 0ustar00.. include:: subst.rst OpenCL Runtime: Command Queues and Events ========================================= .. currentmodule:: pyopencl Command Queue ------------- .. class:: CommandQueue(context, device=None, properties=None) Create a new command queue. *properties* is a bit field consisting of :class:`command_queue_properties` values. If *device* is None, one of the devices in *context* is chosen in an implementation-defined manner. *properties* may be a bitwise combination of values from :class:`queue_properties` (or *None* which is equivalent to passing *0*). This is compatible with both OpenCL 1.x and 2.x. For OpenCL 2.0 and above, *properties* may also be a sequence of keys and values from :class:`queue_properties` as accepted by :c:func:`clCreateCommandQueueWithProperties` (see the OpenCL spec for details). The trailing *0* is added automatically and does not need to be included. A :class:`CommandQueue` may be used as a context manager, like this:: with cl.CommandQueue(self.cl_context) as queue: enqueue_stuff(queue, ...) :meth:`finish` is automatically called at the end of the ``with``-delimited context, and further operations on the queue are considered an error. .. versionadded:: 2013.1 Context manager capability. .. versionchanged:: 2018.2 Added the sequence-of-properties interface for OpenCL 2. .. versionchanged:: 2022.1.4 Use of a command queue after its context manager completes is now considered an error. :mod:`pyopencl` will warn about this for a transitionary period and will start raising an exception in 2023. .. attribute:: info Lower case versions of the :class:`command_queue_info` constants may be used as attributes on instances of this class to directly query info attributes. .. method:: get_info(param) See :class:`command_queue_info` for values of *param*. .. method:: set_property(prop, enable) See :class:`command_queue_properties` for possible values of *prop*. *enable* is a :class:`bool`. Unavailable in OpenCL 1.1 and newer. .. method:: flush() .. method:: finish() .. automethod:: from_int_ptr .. autoattribute:: int_ptr |comparable| Event ----- .. class:: Event .. attribute:: info Lower case versions of the :class:`event_info` constants may be used as attributes on instances of this class to directly query info attributes. .. attribute:: profile An instance of :class:`ProfilingInfoGetter`. .. method:: get_info(param) See :class:`event_info` for values of *param*. .. method:: get_profiling_info(param) See :class:`profiling_info` for values of *param*. See :attr:`profile` for an easier way of obtaining the same information. .. method:: wait() .. automethod:: from_int_ptr .. autoattribute:: int_ptr .. method:: set_callback(type, cb) Add the callback *cb* with signature ``cb(status)`` to the callback queue for the event status *type* (one of the values of :class:`command_execution_status`, except :attr:`command_execution_status.QUEUED`). See the OpenCL specification for restrictions on what *cb* may and may not do. .. versionadded:: 2015.2 |comparable| .. class:: ProfilingInfoGetter .. attribute:: info Lower case versions of the :class:`profiling_info` constants may be used as attributes on the attribute ``profile`` of this class to directly query profiling info. For example, you may use *evt.profile.end* instead of *evt.get_profiling_info(pyopencl.profiling_info.END)*. Event Subclasses ---------------- .. class:: UserEvent(context) A subclass of :class:`Event`. Only available with OpenCL 1.1 and newer. .. versionadded:: 0.92 .. method:: set_status(status) See :class:`command_execution_status` for possible values of *status*. .. class:: NannyEvent Transfers between host and device return events of this type. They hold a reference to the host-side buffer and wait for the transfer to complete when they are freed. Therefore, they can safely release the reference to the object they're guarding upon destruction. A subclass of :class:`Event`. .. versionadded:: 2011.2 .. method:: get_ward() .. method:: wait() In addition to performing the same wait as :meth:`Event.wait()`, this method also releases the reference to the guarded object. Synchronization Functions ------------------------- .. function:: wait_for_events(events) .. function:: enqueue_barrier(queue, wait_for=None) Enqueues a barrier operation. which ensures that all queued commands in command_queue have finished execution. This command is a synchronization point. .. versionadded:: 0.91.5 .. versionchanged:: 2011.2 Takes *wait_for* and returns an :class:`Event` .. function:: enqueue_marker(queue, wait_for=None) Returns an :class:`Event`. .. versionchanged:: 2011.2 Takes *wait_for*. pyopencl-2025.1/doc/subst.rst0000644000000000000000000000430414332717401013024 0ustar00.. |comparable| replace:: Instances of this class are hashable, and two instances of this class may be compared using *"=="* and *"!="*. (Hashability was added in version 2011.2.) Two objects are considered the same if the underlying OpenCL object is the same, as established by C pointer equality. .. |buf-iface| replace:: must implement the Python buffer interface. (e.g. by being an :class:`numpy.ndarray`) .. |explain-waitfor| replace:: *wait_for* may either be *None* or a list of :class:`pyopencl.Event` instances for whose completion this command waits before starting execution. .. |std-enqueue-blurb| replace:: Returns a new :class:`pyopencl.Event`. |explain-waitfor| .. |copy-depr| replace:: **Note:** This function is deprecated as of PyOpenCL 2011.1. Use :func:`~pyopencl.enqueue_copy` instead. .. |glsize| replace:: *global_size* and *local_size* are tuples of identical length, with between one and three entries. *global_size* specifies the overall size of the computational grid: one work item will be launched for every integer point in the grid. *local_size* specifies the workgroup size, which must evenly divide the *global_size* in a dimension-by-dimension manner. *None* may be passed for local_size, in which case the implementation will use an implementation-defined workgroup size. If *g_times_l* is *True*, the global size will be multiplied by the local size. (which makes the behavior more like Nvidia CUDA) In this case, *global_size* and *local_size* also do not have to have the same number of entries. .. |empty-nd-range| replace:: *allow_empty_ndrange* is a :class:`bool` indicating how an empty NDRange is to be treated, where "empty" means that one or more entries of *global_size* or *local_size* are zero. OpenCL itself does not allow enqueueing kernels over empty NDRanges. Setting this flag to *True* enqueues a marker with a wait list (``clEnqueueMarkerWithWaitList``) to obtain the synchronization effects that would have resulted from the kernel enqueue. Setting *allow_empty_ndrange* to *True* requires OpenCL 1.2 or newer. pyopencl-2025.1/doc/tools.rst0000644000000000000000000000010614332717401013020 0ustar00Built-in Utilities ================== .. automodule:: pyopencl.tools pyopencl-2025.1/doc/types.rst0000644000000000000000000000233614332717401013033 0ustar00OpenCL Type Mapping =================== .. module:: pyopencl.cltypes .. _type-mappings: Scalar Types ------------ For ease of use, a the :mod:`pyopencl.cltypes` module provides convenient mapping from OpenCL type names to their equivalent :mod:`numpy` types. This saves you from referring back to the OpenCL spec to see that a ``cl_long`` is 64 bit unsigned integer. Use the module as follows: .. doctest:: >>> import numpy as np >>> import pyopencl as cl >>> import pyopencl.cltypes >>> cl_uint = cl.cltypes.uint(42) # maps to numpy.uint32 >>> cl_long = cl.cltypes.long(1235) # maps to numpy.int64 >>> floats = np.empty((128,), dtype=cl.cltypes.float) # array of numpy.float32 .. note:: The OpenCL type ``bool`` does not have a corresponding :mod:`numpy` type defined here, because OpenCL does not specify the in-memory representation (or even the storage size) for this type. Vector Types ------------ The corresponding vector types are also made available in the same package, allowing you to easily create :mod:`numpy` arrays with the appropriate memory layout. .. doctest:: >>> import numpy as np >>> array_of_float16 = np.empty((128,), dtype=cl.cltypes.float16) # array of float16 pyopencl-2025.1/examples/.gitignore0000644000000000000000000000001614332717401014167 0ustar00wiki-examples pyopencl-2025.1/examples/black-hole-accretion.py0000644000000000000000000015272014332717401016531 0ustar00#!/usr/bin/env python3 # # TrouNoir model using PyOpenCL or PyCUDA # # CC BY-NC-SA 2019 : # # Part of matrix programs from: https://forge.cbp.ens-lyon.fr/svn/bench4gpu/ # # Thanks to Andreas Klockner for PyOpenCL and PyCUDA: # http://mathema.tician.de/software/pyopencl # # Original code programmed in Fortran 77 in mars 1994 # for Practical Work of Numerical Simulation # DEA (old Master2) in astrophysics and spatial techniques in Meudon # by Herve Aussel & Emmanuel Quemener # # Conversion in C done by Emmanuel Quemener in august 1997 # GPUfication in OpenCL under Python in july 2019 # GPUfication in CUDA under Python in august 2019 # # Thanks to : # # - Herve Aussel for his part of code of black body spectrum # - Didier Pelat for his help to perform this work # - Jean-Pierre Luminet for his article published in 1979 # - Numerical Recipes for Runge Kutta recipes # - Luc Blanchet for his disponibility about my questions in General Relativity # - Pierre Lena for his passion about science and vulgarisation # If crash on OpenCL Intel implementation, add following options and force # export PYOPENCL_COMPILER_OUTPUT=1 # export CL_CONFIG_USE_VECTORIZER=True # export CL_CONFIG_CPU_VECTORIZER_MODE=16 import getopt import sys import time from socket import gethostname import numpy import pyopencl as cl def DictionariesAPI(): PhysicsList = {"Einstein": 0, "Newton": 1} return PhysicsList # # Blank space below to simplify debugging on OpenCL code # BlobOpenCL = """ #define PI (float)3.14159265359e0f #define nbr 256 #define EINSTEIN 0 #define NEWTON 1 #ifdef SETTRACKPOINTS #define TRACKPOINTS SETTRACKPOINTS #else #define TRACKPOINTS 2048 #endif float atanp(float x,float y) { float angle; angle=atan2(y,x); if (angle<0.e0f) { angle+=(float)2.e0f*PI; } return angle; } float f(float v) { return v; } #if PHYSICS == NEWTON float g(float u,float m,float b) { return (-u); } #else float g(float u,float m,float b) { return (3.e0f*m/b*pow(u,2)-u); } #endif void calcul(float *us,float *vs,float up,float vp, float h,float m,float b) { float c0,c1,c2,c3,d0,d1,d2,d3; c0=h*f(vp); c1=h*f(vp+c0/2.e0f); c2=h*f(vp+c1/2.e0f); c3=h*f(vp+c2); d0=h*g(up,m,b); d1=h*g(up+d0/2.e0f,m,b); d2=h*g(up+d1/2.e0f,m,b); d3=h*g(up+d2,m,b); *us=up+(c0+2.e0f*c1+2.e0f*c2+c3)/6.e0f; *vs=vp+(d0+2.e0f*d1+2.e0f*d2+d3)/6.e0f; } void rungekutta(float *ps,float *us,float *vs, float pp,float up,float vp, float h,float m,float b) { calcul(us,vs,up,vp,h,m,b); *ps=pp+h; } float decalage_spectral(float r,float b,float phi, float tho,float m) { return (sqrt(1-3*m/r)/(1+sqrt(m/pow(r,3))*b*sin(tho)*sin(phi))); } float spectre(float rf,int q,float b,float db, float h,float r,float m,float bss) { float flx; // flx=exp(q*log(r/m))*pow(rf,4)*b*db*h; flx=exp(q*log(r/m)+4.e0f*log(rf))*b*db*h; return(flx); } float spectre_cn(float rf32,float b32,float db32, float h32,float r32,float m32,float bss32) { #define MYFLOAT float MYFLOAT rf=(MYFLOAT)(rf32); MYFLOAT b=(MYFLOAT)(b32); MYFLOAT db=(MYFLOAT)(db32); MYFLOAT h=(MYFLOAT)(h32); MYFLOAT r=(MYFLOAT)(r32); MYFLOAT m=(MYFLOAT)(m32); MYFLOAT bss=(MYFLOAT)(bss32); MYFLOAT flx; MYFLOAT nu_rec,nu_em,qu,temp_em,flux_int; int fi,posfreq; #define planck 6.62e-34f #define k 1.38e-23f #define c2 9.e16f #define temp 3.e7f #define m_point 1.e0f #define lplanck (log(6.62e0f)-34.e0f*log(10.e0f)) #define lk (log(1.38e0f)-23.e0f*log(10.e0f)) #define lc2 (log(9.e0f)+16.e0f*log(10.e0f)) MYFLOAT v=1.e0f-3.e0f/r; qu=1.e0f/sqrt((1.e0f-3.e0f/r)*r)*(sqrt(r)-sqrt(6.e0f)+sqrt(3.e0f)/2.e0f*log((sqrt(r)+sqrt(3.e0f))/(sqrt(r)-sqrt(3.e0f))* 0.17157287525380988e0f )); // # noqa: E501 temp_em=temp*sqrt(m)*exp(0.25e0f*log(m_point)-0.75e0f*log(r)-0.125e0f*log(v)+0.25e0f*log(fabs(qu))); flux_int=0.e0f; flx=0.e0f; for (fi=0;fi0)&&(posfreqfmod(phd,PI)))&&(rps>=ri)&&(rps<=re)?1:0; } while ((rps>=rs)&&(rps<=rp0)&&(ExitOnImpact==0)&&(nh=ri)) { ExitOnImpact=1; impact(phi,r,b,tho,m,&zp,&fp,q,db,h,raie); } HalfLap++; } while ((HalfLap<=2)&&(ExitOnImpact==0)); } barrier(CLK_GLOBAL_MEM_FENCE); zImage[yi+sizex*xi]=zp; fImage[yi+sizex*xi]=fp; } __kernel void Circle(__global float *Trajectories,__global int *IdLast, __global float *zImage,__global float *fImage, float Mass,float InternalRadius, float ExternalRadius,float Angle, int Line) { // Integer Impact Parameter ID int bi=get_global_id(0); // Integer points on circle int i=get_global_id(1); // Integer Impact Parameter Size (half of image) int bmaxi=get_global_size(0); // Integer Points on circle int imx=get_global_size(1); // Perform trajectory for each pixel float m,ri,re,tho; int q,raie; m=Mass; ri=InternalRadius; re=ExternalRadius; tho=Angle; raie=Line; float bmx,db,b,h; float phi,phd; float zp=0.e0f,fp=0.e0f; // Autosize for image bmx=1.25e0f*re; // Angular step of integration h=4.e0f*PI/(float)TRACKPOINTS; // impact parameter b=(float)bi/(float)bmaxi*bmx; db=bmx/(2.e0f*(float)bmaxi); phi=2.e0f*PI/(float)imx*(float)i; phd=atanp(cos(phi)*sin(tho),cos(tho)); int yi=(int)((float)bi*sin(phi))+bmaxi; int xi=(int)((float)bi*cos(phi))+bmaxi; int HalfLap=0,ExitOnImpact=0,ni; float php,nr,r; do { php=phd+(float)HalfLap*PI; nr=php/h; ni=(int)nr; if (ni=ri)) { ExitOnImpact=1; impact(phi,r,b,tho,m,&zp,&fp,q,db,h,raie); } HalfLap++; } while ((HalfLap<=2)&&(ExitOnImpact==0)); zImage[yi+2*bmaxi*xi]=zp; fImage[yi+2*bmaxi*xi]=fp; barrier(CLK_GLOBAL_MEM_FENCE); } __kernel void Trajectory(__global float *Trajectories,__global int *IdLast, float Mass,float InternalRadius, float ExternalRadius,float Angle, int Line) { // Integer Impact Parameter ID int bi=get_global_id(0); // Integer Impact Parameter Size (half of image) int bmaxi=get_global_size(0); // Perform trajectory for each pixel float m,rs,re; m=Mass; rs=2.e0f*m; re=ExternalRadius; float bmx,b,h; int nh; // Autosize for image bmx=1.25e0f*re; // Angular step of integration h=4.e0f*PI/(float)TRACKPOINTS; // impact parameter b=(float)bi/(float)bmaxi*bmx; float up,vp,pp,us,vs,ps; up=0.e0f; vp=1.e0f; pp=0.e0f; nh=0; rungekutta(&ps,&us,&vs,pp,up,vp,h,m,b); // b versus us float bvus=fabs(b/us); float bvus0=bvus; Trajectories[bi*TRACKPOINTS+nh]=bvus; do { nh++; pp=ps; up=us; vp=vs; rungekutta(&ps,&us,&vs,pp,up,vp,h,m,b); bvus=fabs(b/us); Trajectories[bi*TRACKPOINTS+nh]=bvus; } while ((bvus>=rs)&&(bvus<=bvus0)); IdLast[bi]=nh; barrier(CLK_GLOBAL_MEM_FENCE); } __kernel void EachCircle(__global float *zImage,__global float *fImage, float Mass,float InternalRadius, float ExternalRadius,float Angle, int Line) { // Integer Impact Parameter ID uint bi=(uint)get_global_id(0); // Integer Impact Parameter Size (half of image) uint bmaxi=(uint)get_global_size(0); private float Trajectory[TRACKPOINTS]; float m,rs,ri,re,tho; int raie,q; m=Mass; rs=2.e0f*m; ri=InternalRadius; re=ExternalRadius; tho=Angle; q=-2; raie=Line; float bmx,db,b,h; uint nh; // Autosize for image bmx=1.25e0f*re; // Angular step of integration h=4.e0f*PI/(float)TRACKPOINTS; // impact parameter b=(float)bi/(float)bmaxi*bmx; db=bmx/(2.e0f*(float)bmaxi); float up,vp,pp,us,vs,ps; up=0.e0f; vp=1.e0f; pp=0.e0f; nh=0; rungekutta(&ps,&us,&vs,pp,up,vp,h,m,b); // b versus us float bvus=fabs(b/us); float bvus0=bvus; Trajectory[nh]=bvus; do { nh++; pp=ps; up=us; vp=vs; rungekutta(&ps,&us,&vs,pp,up,vp,h,m,b); bvus=(float)fabs(b/us); Trajectory[nh]=bvus; } while ((bvus>=rs)&&(bvus<=bvus0)); for (uint i=(uint)nh+1;i=ri)) { ExitOnImpact=1; impact(phi,r,b,tho,m,&zp,&fp,q,db,h,raie); } HalfLap++; } while ((HalfLap<=2)&&(ExitOnImpact==0)); zImage[yi+2*bmaxi*xi]=zp; fImage[yi+2*bmaxi*xi]=fp; } barrier(CLK_GLOBAL_MEM_FENCE); } __kernel void Original(__global float *zImage,__global float *fImage, uint Size,float Mass,float InternalRadius, float ExternalRadius,float Angle, int Line) { // Integer Impact Parameter Size (half of image) uint bmaxi=(uint)Size; float Trajectory[TRACKPOINTS]; // Perform trajectory for each pixel float m,rs,ri,re,tho; int raie,q; m=Mass; rs=2.e0f*m; ri=InternalRadius; re=ExternalRadius; tho=Angle; q=-2; raie=Line; float bmx,db,b,h; uint nh; // Autosize for image bmx=1.25e0f*re; // Angular step of integration h=4.e0f*PI/(float)TRACKPOINTS; // Integer Impact Parameter ID for (int bi=0;bi=rs)&&(bvus<=bvus0)); for (uint i=(uint)nh+1;i=ri)) { ExitOnImpact=1; impact(phi,r,b,tho,m,&zp,&fp,q,db,h,raie); } HalfLap++; } while ((HalfLap<=2)&&(ExitOnImpact==0)); zImage[yi+2*bmaxi*xi]=zp; fImage[yi+2*bmaxi*xi]=fp; } } barrier(CLK_GLOBAL_MEM_FENCE); } """ def KernelCodeCuda(): BlobCUDA = """ #define PI (float)3.14159265359 #define nbr 256 #define EINSTEIN 0 #define NEWTON 1 #ifdef SETTRACKPOINTS #define TRACKPOINTS SETTRACKPOINTS #else #define TRACKPOINTS #endif __device__ float nothing(float x) { return(x); } __device__ float atanp(float x,float y) { float angle; angle=atan2(y,x); if (angle<0.e0f) { angle+=(float)2.e0f*PI; } return(angle); } __device__ float f(float v) { return(v); } #if PHYSICS == NEWTON __device__ float g(float u,float m,float b) { return (-u); } #else __device__ float g(float u,float m,float b) { return (3.e0f*m/b*pow(u,2)-u); } #endif __device__ void calcul(float *us,float *vs,float up,float vp, float h,float m,float b) { float c0,c1,c2,c3,d0,d1,d2,d3; c0=h*f(vp); c1=h*f(vp+c0/2.); c2=h*f(vp+c1/2.); c3=h*f(vp+c2); d0=h*g(up,m,b); d1=h*g(up+d0/2.,m,b); d2=h*g(up+d1/2.,m,b); d3=h*g(up+d2,m,b); *us=up+(c0+2.*c1+2.*c2+c3)/6.; *vs=vp+(d0+2.*d1+2.*d2+d3)/6.; } __device__ void rungekutta(float *ps,float *us,float *vs, float pp,float up,float vp, float h,float m,float b) { calcul(us,vs,up,vp,h,m,b); *ps=pp+h; } __device__ float decalage_spectral(float r,float b,float phi, float tho,float m) { return (sqrt(1-3*m/r)/(1+sqrt(m/pow(r,3))*b*sin(tho)*sin(phi))); } __device__ float spectre(float rf,int q,float b,float db, float h,float r,float m,float bss) { float flx; // flx=exp(q*log(r/m))*pow(rf,4)*b*db*h; flx=exp(q*log(r/m)+4.*log(rf))*b*db*h; return(flx); } __device__ float spectre_cn(float rf32,float b32,float db32, float h32,float r32,float m32,float bss32) { #define MYFLOAT float MYFLOAT rf=(MYFLOAT)(rf32); MYFLOAT b=(MYFLOAT)(b32); MYFLOAT db=(MYFLOAT)(db32); MYFLOAT h=(MYFLOAT)(h32); MYFLOAT r=(MYFLOAT)(r32); MYFLOAT m=(MYFLOAT)(m32); MYFLOAT bss=(MYFLOAT)(bss32); MYFLOAT flx; MYFLOAT nu_rec,nu_em,qu,temp_em,flux_int; int fi,posfreq; #define planck 6.62e-34 #define k 1.38e-23 #define c2 9.e16 #define temp 3.e7 #define m_point 1. #define lplanck (log(6.62)-34.*log(10.)) #define lk (log(1.38)-23.*log(10.)) #define lc2 (log(9.)+16.*log(10.)) MYFLOAT v=1.-3./r; qu=1./sqrt((1.-3./r)*r)*(sqrt(r)-sqrt(6.)+sqrt(3.)/2.*log((sqrt(r)+sqrt(3.))/(sqrt(r)-sqrt(3.))* 0.17157287525380988 )); // # noqa: #051 temp_em=temp*sqrt(m)*exp(0.25*log(m_point)-0.75*log(r)-0.125*log(v)+0.25*log(fabs(qu))); flux_int=0.; flx=0.; for (fi=0;fi0)&&(posfreqfmod(phd,PI)))&&(rps>ri)&&(rps=rs)&&(rps<=rp0)&&(ExitOnImpact==0)); if (ExitOnImpact==1) { impact(phi,rpp,b,tho,m,&zp,&fp,q,db,h,raie); } else { zp=0.e0f; fp=0.e0f; } __syncthreads(); zImage[yi+sizex*xi]=(float)zp; fImage[yi+sizex*xi]=(float)fp; } __global__ void Pixel(float *zImage,float *fImage, float *Trajectories,int *IdLast, uint ImpactParameter, float Mass,float InternalRadius, float ExternalRadius,float Angle, int Line) { uint xi=(uint)(blockIdx.x*blockDim.x+threadIdx.x); uint yi=(uint)(blockIdx.y*blockDim.y+threadIdx.y); uint sizex=(uint)gridDim.x*blockDim.x; uint sizey=(uint)gridDim.y*blockDim.y; // Perform trajectory for each pixel float m,ri,re,tho; int q,raie; m=Mass; ri=InternalRadius; re=ExternalRadius; tho=Angle; q=-2; raie=Line; float bmx,db,b,h; float phi,phd,php,nr,r; float zp=0,fp=0; // Autosize for image, 25% greater than external radius bmx=1.25e0f*re; // Angular step of integration h=4.e0f*PI/(float)TRACKPOINTS; // Step of Impact Parameter db=bmx/(2.e0f*(float)ImpactParameter); // set origin as center of image float x=(float)xi-(float)(sizex/2)+(float)5e-1f; float y=(float)yi-(float)(sizey/2)+(float)5e-1f; // angle extracted from cylindric symmetry phi=atanp(x,y); phd=atanp(cos(phi)*sin(tho),cos(tho)); // Real Impact Parameter b=sqrt(x*x+y*y)*bmx/(float)ImpactParameter; // Integer Impact Parameter uint bi=(uint)sqrt(x*x+y*y); int HalfLap=0,ExitOnImpact=0,ni; if (bi=ri)) { ExitOnImpact=1; impact(phi,r,b,tho,m,&zp,&fp,q,db,h,raie); } HalfLap++; } while ((HalfLap<=2)&&(ExitOnImpact==0)); } zImage[yi+sizex*xi]=zp; fImage[yi+sizex*xi]=fp; } __global__ void Circle(float *Trajectories,int *IdLast, float *zImage,float *fImage, float Mass,float InternalRadius, float ExternalRadius,float Angle, int Line) { // Integer Impact Parameter ID int bi=blockIdx.x*blockDim.x+threadIdx.x; // Integer points on circle int i=blockIdx.y*blockDim.y+threadIdx.y; // Integer Impact Parameter Size (half of image) int bmaxi=gridDim.x*blockDim.x; // Integer Points on circle int imx=gridDim.y*blockDim.y; // Perform trajectory for each pixel float m,ri,re,tho; int q,raie; m=Mass; ri=InternalRadius; re=ExternalRadius; tho=Angle; raie=Line; float bmx,db,b,h; float phi,phd; float zp=0,fp=0; // Autosize for image bmx=1.25e0f*re; // Angular step of integration h=4.e0f*PI/(float)TRACKPOINTS; // impact parameter b=(float)bi/(float)bmaxi*bmx; db=bmx/(2.e0f*(float)bmaxi); phi=2.e0f*PI/(float)imx*(float)i; phd=atanp(cos(phi)*sin(tho),cos(tho)); int yi=(int)((float)bi*sin(phi))+bmaxi; int xi=(int)((float)bi*cos(phi))+bmaxi; int HalfLap=0,ExitOnImpact=0,ni; float php,nr,r; do { php=phd+(float)HalfLap*PI; nr=php/h; ni=(int)nr; if (ni=ri)) { ExitOnImpact=1; impact(phi,r,b,tho,m,&zp,&fp,q,db,h,raie); } HalfLap++; } while ((HalfLap<=2)&&(ExitOnImpact==0)); zImage[yi+2*bmaxi*xi]=zp; fImage[yi+2*bmaxi*xi]=fp; } __global__ void Trajectory(float *Trajectories,int *IdLast, float Mass,float InternalRadius, float ExternalRadius,float Angle, int Line) { // Integer Impact Parameter ID int bi=blockIdx.x*blockDim.x+threadIdx.x; // Integer Impact Parameter Size (half of image) int bmaxi=gridDim.x*blockDim.x; // Perform trajectory for each pixel float m,rs,re; m=Mass; rs=2.e0f*m; re=ExternalRadius; float bmx,b,h; int nh; // Autosize for image bmx=1.25e0f*re; // Angular step of integration h=4.e0f*PI/(float)TRACKPOINTS; // impact parameter b=(float)bi/(float)bmaxi*bmx; float up,vp,pp,us,vs,ps; up=0.e0f; vp=1.e0f; pp=0.e0f; nh=0; rungekutta(&ps,&us,&vs,pp,up,vp,h,m,b); // b versus us float bvus=fabs(b/us); float bvus0=bvus; Trajectories[bi*TRACKPOINTS+nh]=bvus; do { nh++; pp=ps; up=us; vp=vs; rungekutta(&ps,&us,&vs,pp,up,vp,h,m,b); bvus=fabs(b/us); Trajectories[bi*TRACKPOINTS+nh]=bvus; } while ((bvus>=rs)&&(bvus<=bvus0)); IdLast[bi]=nh; } __global__ void EachCircle(float *zImage,float *fImage, float Mass,float InternalRadius, float ExternalRadius,float Angle, int Line) { // Integer Impact Parameter ID int bi=blockIdx.x*blockDim.x+threadIdx.x; // Integer Impact Parameter Size (half of image) int bmaxi=gridDim.x*blockDim.x; float Trajectory[2048]; // Perform trajectory for each pixel float m,rs,ri,re,tho; int raie,q; m=Mass; rs=2.*m; ri=InternalRadius; re=ExternalRadius; tho=Angle; q=-2; raie=Line; float bmx,db,b,h; int nh; // Autosize for image bmx=1.25e0f*re; // Angular step of integration h=4.e0f*PI/(float)TRACKPOINTS; // impact parameter b=(float)bi/(float)bmaxi*bmx; db=bmx/(2.e0f*(float)bmaxi); float up,vp,pp,us,vs,ps; up=0.; vp=1.; pp=0.; nh=0; rungekutta(&ps,&us,&vs,pp,up,vp,h,m,b); // b versus us float bvus=fabs(b/us); float bvus0=bvus; Trajectory[nh]=bvus; do { nh++; pp=ps; up=us; vp=vs; rungekutta(&ps,&us,&vs,pp,up,vp,h,m,b); bvus=fabs(b/us); Trajectory[nh]=bvus; } while ((bvus>=rs)&&(bvus<=bvus0)); int imx=(int)(16*bi); for (int i=0;i=ri)) { ExitOnImpact=1; impact(phi,r,b,tho,m,&zp,&fp,q,db,h,raie); } HalfLap++; } while ((HalfLap<=2)&&(ExitOnImpact==0)); __syncthreads(); zImage[yi+2*bmaxi*xi]=zp; fImage[yi+2*bmaxi*xi]=fp; } } __global__ void Original(float *zImage,float *fImage, uint Size,float Mass,float InternalRadius, float ExternalRadius,float Angle, int Line) { // Integer Impact Parameter Size (half of image) uint bmaxi=(uint)Size; float Trajectory[TRACKPOINTS]; // Perform trajectory for each pixel float m,rs,ri,re,tho; int raie,q; m=Mass; rs=2.e0f*m; ri=InternalRadius; re=ExternalRadius; tho=Angle; q=-2; raie=Line; float bmx,db,b,h; int nh; // Autosize for image bmx=1.25e0f*re; // Angular step of integration h=4.e0f*PI/(float)TRACKPOINTS; // Integer Impact Parameter ID for (int bi=0;bi=rs)&&(bvus<=bvus0)); for (uint i=(uint)nh+1;i=ri)) { ExitOnImpact=1; impact(phi,r,b,tho,m,&zp,&fp,q,db,h,raie); } HalfLap++; } while ((HalfLap<=2)&&(ExitOnImpact==0)); zImage[yi+2*bmaxi*xi]=zp; fImage[yi+2*bmaxi*xi]=fp; } } } """ return BlobCUDA # def ImageOutput(sigma,prefix,Colors): # import matplotlib.pyplot as plt # start_time=time.time() # if Colors == 'Red2Yellow': # plt.imsave("%s.png" % prefix, sigma, cmap='afmhot') # else: # plt.imsave("%s.png" % prefix, sigma, cmap='Greys_r') # save_time = time.time()-start_time # print("Save image as %s.png file" % prefix) # print("Save Time : %f" % save_time) def ImageOutput(sigma, prefix, Colors): from PIL import Image Max = sigma.max() Min = sigma.min() # Normalize value as 8bits Integer SigmaInt = (255 * (sigma - Min) / (Max - Min)).astype("uint8") image = Image.fromarray(SigmaInt) image.save("%s.jpg" % prefix) def BlackHoleCL(zImage, fImage, InputCL): Device = InputCL["Device"] Mass = InputCL["Mass"] InternalRadius = InputCL["InternalRadius"] ExternalRadius = InputCL["ExternalRadius"] Angle = InputCL["Angle"] Method = InputCL["Method"] TrackPoints = InputCL["TrackPoints"] Physics = InputCL["Physics"] NoImage = InputCL["NoImage"] TrackSave = InputCL["TrackSave"] PhysicsList = DictionariesAPI() if InputCL["BlackBody"]: # Spectrum is Black Body one Line = 0 else: # Spectrum is Monochromatic Line one Line = 1 Trajectories = numpy.zeros( (int(InputCL["Size"] / 2), InputCL["TrackPoints"]), dtype=numpy.float32 ) IdLast = numpy.zeros(int(InputCL["Size"] / 2), dtype=numpy.int32) # Je detecte un peripherique GPU dans la liste des peripheriques Id = 0 HasXPU = False for platform in cl.get_platforms(): for device in platform.get_devices(): if Id == Device: PF4XPU = platform.name XPU = device print("CPU/GPU selected: ", device.name.lstrip()) HasXPU = True Id += 1 if not HasXPU: print("No XPU #%i found in all of %i devices, sorry..." % (Device, Id - 1)) sys.exit() ctx = cl.Context([XPU]) queue = cl.CommandQueue( ctx, properties=cl.command_queue_properties.PROFILING_ENABLE ) BuildOptions = "-DPHYSICS=%i -DSETTRACKPOINTS=%i " % ( PhysicsList[Physics], InputCL["TrackPoints"], ) print("My Platform is ", PF4XPU) if ( "Intel" in PF4XPU or "Experimental" in PF4XPU or "Clover" in PF4XPU or "Portable" in PF4XPU ): print("No extra options for Intel and Clover!") else: BuildOptions = BuildOptions + " -cl-mad-enable" BlackHoleCL = cl.Program(ctx, BlobOpenCL).build(options=BuildOptions) # Je recupere les flag possibles pour les buffers mf = cl.mem_flags if Method == "TrajectoPixel" or Method == "TrajectoCircle": TrajectoriesCL = cl.Buffer( ctx, mf.WRITE_ONLY | mf.COPY_HOST_PTR, hostbuf=Trajectories ) IdLastCL = cl.Buffer(ctx, mf.WRITE_ONLY | mf.COPY_HOST_PTR, hostbuf=IdLast) zImageCL = cl.Buffer(ctx, mf.WRITE_ONLY | mf.COPY_HOST_PTR, hostbuf=zImage) fImageCL = cl.Buffer(ctx, mf.WRITE_ONLY | mf.COPY_HOST_PTR, hostbuf=fImage) start_time = time.time() if Method == "EachPixel": CLLaunch = BlackHoleCL.EachPixel( queue, (zImage.shape[0], zImage.shape[1]), None, zImageCL, fImageCL, numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), ) CLLaunch.wait() elif Method == "Original": CLLaunch = BlackHoleCL.Original( queue, (1,), None, zImageCL, fImageCL, numpy.uint32(zImage.shape[0] / 2), numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), ) CLLaunch.wait() elif Method == "EachCircle": CLLaunch = BlackHoleCL.EachCircle( queue, (int(zImage.shape[0] / 2),), None, zImageCL, fImageCL, numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), ) CLLaunch.wait() elif Method == "TrajectoCircle": CLLaunch = BlackHoleCL.Trajectory( queue, (Trajectories.shape[0],), None, TrajectoriesCL, IdLastCL, numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), ) CLLaunch = BlackHoleCL.Circle( queue, (Trajectories.shape[0], int(zImage.shape[0] * 4)), None, TrajectoriesCL, IdLastCL, zImageCL, fImageCL, numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), ) CLLaunch.wait() else: CLLaunch = BlackHoleCL.Trajectory( queue, (Trajectories.shape[0],), None, TrajectoriesCL, IdLastCL, numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), ) CLLaunch = BlackHoleCL.Pixel( queue, (zImage.shape[0], zImage.shape[1]), None, zImageCL, fImageCL, TrajectoriesCL, IdLastCL, numpy.uint32(Trajectories.shape[0]), numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), ) CLLaunch.wait() compute = time.time() - start_time cl.enqueue_copy(queue, zImage, zImageCL).wait() cl.enqueue_copy(queue, fImage, fImageCL).wait() if Method == "TrajectoPixel" or Method == "TrajectoCircle": cl.enqueue_copy(queue, Trajectories, TrajectoriesCL).wait() cl.enqueue_copy(queue, IdLast, IdLastCL).wait() elapsed = time.time() - start_time print("\nCompute Time : %f" % compute) print("Elapsed Time : %f\n" % elapsed) zMaxPosition = numpy.where(zImage[:, :] == zImage.max()) fMaxPosition = numpy.where(fImage[:, :] == fImage.max()) print( "Z max @(%f,%f) : %f" % ( ( 1.0 * zMaxPosition[1][0] / zImage.shape[1] - 0.5, 1.0 * zMaxPosition[0][0] / zImage.shape[0] - 0.5, zImage.max(), ) ) ) print( "Flux max @(%f,%f) : %f" % ( ( 1.0 * fMaxPosition[1][0] / fImage.shape[1] - 0.5, 1.0 * fMaxPosition[0][0] / fImage.shape[0] - 0.5, fImage.max(), ) ) ) zImageCL.release() fImageCL.release() if Method == "TrajectoPixel" or Method == "TrajectoCircle": if not NoImage: AngleStep = 4 * numpy.pi / TrackPoints Angles = numpy.arange(0.0, 4 * numpy.pi, AngleStep) Angles.shape = (1, TrackPoints) if TrackSave: # numpy.savetxt("TrouNoirTrajectories_%s.csv" % ImageInfo, # numpy.transpose(numpy.concatenate((Angles,Trajectories),axis=0)), # delimiter=' ', fmt='%.2e') numpy.savetxt( "TrouNoirTrajectories.csv", numpy.transpose(numpy.concatenate((Angles, Trajectories), axis=0)), delimiter=" ", fmt="%.2e", ) TrajectoriesCL.release() IdLastCL.release() return elapsed def BlackHoleCUDA(zImage, fImage, InputCL): Device = InputCL["Device"] Mass = InputCL["Mass"] InternalRadius = InputCL["InternalRadius"] ExternalRadius = InputCL["ExternalRadius"] Angle = InputCL["Angle"] Method = InputCL["Method"] TrackPoints = InputCL["TrackPoints"] Physics = InputCL["Physics"] Threads = InputCL["Threads"] PhysicsList = DictionariesAPI() if InputCL["BlackBody"]: # Spectrum is Black Body one Line = 0 else: # Spectrum is Monochromatic Line one Line = 1 Trajectories = numpy.zeros( (int(InputCL["Size"] / 2), InputCL["TrackPoints"]), dtype=numpy.float32 ) IdLast = numpy.zeros(int(InputCL["Size"] / 2), dtype=numpy.int32) try: # For PyCUDA import import pycuda.driver as cuda from pycuda.compiler import SourceModule cuda.init() for Id in range(cuda.Device.count()): if Id == Device: XPU = cuda.Device(Id) print("GPU selected %s" % XPU.name()) print except ImportError: print("Platform does not seem to support CUDA") Context = XPU.make_context() try: mod = SourceModule( KernelCodeCuda(), options=[ "--compiler-options", "-DPHYSICS=%i -DSETTRACKPOINTS=%i" % (PhysicsList[Physics], TrackPoints), ], ) print("Compilation seems to be OK") except Exception: print("Compilation seems to break") EachPixelCU = mod.get_function("EachPixel") OriginalCU = mod.get_function("Original") EachCircleCU = mod.get_function("EachCircle") TrajectoryCU = mod.get_function("Trajectory") PixelCU = mod.get_function("Pixel") CircleCU = mod.get_function("Circle") TrajectoriesCU = cuda.mem_alloc(Trajectories.size * Trajectories.dtype.itemsize) cuda.memcpy_htod(TrajectoriesCU, Trajectories) zImageCU = cuda.mem_alloc(zImage.size * zImage.dtype.itemsize) cuda.memcpy_htod(zImageCU, zImage) fImageCU = cuda.mem_alloc(fImage.size * fImage.dtype.itemsize) cuda.memcpy_htod(zImageCU, fImage) IdLastCU = cuda.mem_alloc(IdLast.size * IdLast.dtype.itemsize) cuda.memcpy_htod(IdLastCU, IdLast) start_time = time.time() if Method == "EachPixel": EachPixelCU( zImageCU, fImageCU, numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), grid=(int(zImage.shape[0] / Threads), int(zImage.shape[1] / Threads)), block=(Threads, Threads, 1), ) elif Method == "EachCircle": EachCircleCU( zImageCU, fImageCU, numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), grid=(int(zImage.shape[0] / Threads / 2), 1), block=(Threads, 1, 1), ) elif Method == "Original": OriginalCU( zImageCU, fImageCU, numpy.uint32(zImage.shape[0] / 2), numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), grid=(1, 1), block=(1, 1, 1), ) elif Method == "TrajectoCircle": TrajectoryCU( TrajectoriesCU, IdLastCU, numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), grid=(int(Trajectories.shape[0] / Threads), 1), block=(Threads, 1, 1), ) CircleCU( TrajectoriesCU, IdLastCU, zImageCU, fImageCU, numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), grid=( int(Trajectories.shape[0] / Threads), int(zImage.shape[0] * 4 / Threads), ), block=(Threads, Threads, 1), ) else: # Default method: TrajectoPixel TrajectoryCU( TrajectoriesCU, IdLastCU, numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), grid=(int(Trajectories.shape[0] / Threads), 1), block=(Threads, 1, 1), ) PixelCU( zImageCU, fImageCU, TrajectoriesCU, IdLastCU, numpy.uint32(Trajectories.shape[0]), numpy.float32(Mass), numpy.float32(InternalRadius), numpy.float32(ExternalRadius), numpy.float32(Angle), numpy.int32(Line), grid=(int(zImage.shape[0] / Threads), int(zImage.shape[1] / Threads), 1), block=(Threads, Threads, 1), ) Context.synchronize() compute = time.time() - start_time cuda.memcpy_dtoh(zImage, zImageCU) cuda.memcpy_dtoh(fImage, fImageCU) if Method == "TrajectoPixel" or Method == "TrajectoCircle": cuda.memcpy_dtoh(Trajectories, TrajectoriesCU) elapsed = time.time() - start_time print("\nCompute Time : %f" % compute) print("Elapsed Time : %f\n" % elapsed) zMaxPosition = numpy.where(zImage[:, :] == zImage.max()) fMaxPosition = numpy.where(fImage[:, :] == fImage.max()) print( "Z max @(%f,%f) : %f" % ( ( 1.0 * zMaxPosition[1][0] / zImage.shape[1] - 0.5, 1.0 * zMaxPosition[0][0] / zImage.shape[0] - 0.5, zImage.max(), ) ) ) print( "Flux max @(%f,%f) : %f" % ( ( 1.0 * fMaxPosition[1][0] / fImage.shape[1] - 0.5, 1.0 * fMaxPosition[0][0] / fImage.shape[0] - 0.5, fImage.max(), ) ) ) Context.pop() Context.detach() if Method == "TrajectoPixel" or Method == "TrajectoCircle": if not NoImage: AngleStep = 4 * numpy.pi / TrackPoints Angles = numpy.arange(0.0, 4 * numpy.pi, AngleStep) Angles.shape = (1, TrackPoints) # numpy.savetxt("TrouNoirTrajectories_%s.csv" % ImageInfo, # numpy.transpose(numpy.concatenate((Angles,Trajectories),axis=0)), # delimiter=' ', fmt='%.2e') numpy.savetxt( "TrouNoirTrajectories.csv", numpy.transpose(numpy.concatenate((Angles, Trajectories), axis=0)), delimiter=" ", fmt="%.2e", ) return elapsed if __name__ == "__main__": # Default device: first one! Device = 0 # Default implementation: OpenCL, most versatile! GpuStyle = "OpenCL" Mass = 1.0 # Internal Radius 3 times de Schwarzschild Radius InternalRadius = 6.0 * Mass # ExternalRadius = 12.0 # # Angle with normal to disc 10 degrees Angle = numpy.pi / 180.0 * (90.0 - 10.0) # Radiation of disc : BlackBody or Monochromatic BlackBody = False # Size of image Size = 1024 # Variable Type VariableType = "FP32" # ? q = -2 # Method of resolution Method = "TrajectoPixel" # Colors for output image Colors = "Greyscale" # Physics Physics = "Einstein" # No output as image NoImage = False # Threads in CUDA Threads = 32 # Trackpoints of trajectories TrackPoints = 2048 # Tracksave of trajectories TrackSave = False HowToUse = "%s -h [Help] -b [BlackBodyEmission] -j [TrackSave] -n [NoImage] -p -s -m -i -x -a -d -c -g -o -t -v -k " try: opts, args = getopt.getopt( sys.argv[1:], "hbnjs:m:i:x:a:d:g:v:o:t:c:p:k:", [ "tracksave", "blackbody", "noimage", "camera", "size=", "mass=", "internal=", "external=", "angle=", "device=", "gpustyle=", "variabletype=", "method=", "threads=", "colors=", "physics=", "trackpoints=", ], ) except getopt.GetoptError: print(HowToUse % sys.argv[0]) sys.exit(2) # List of Devices Devices = [] Alu = {} for opt, arg in opts: if opt == "-h": print(HowToUse % sys.argv[0]) print("\nInformations about devices detected under OpenCL API:") # For PyOpenCL import try: Id = 0 for platform in cl.get_platforms(): for device in platform.get_devices(): # deviceType=cl.device_type.to_string(device.type) deviceType = "xPU" print( "Device #%i from %s of type %s : %s" % ( Id, platform.vendor.lstrip(), deviceType, device.name.lstrip(), ) ) Id = Id + 1 except Exception: print("Your platform does not seem to support OpenCL") print("\nInformations about devices detected under CUDA API:") # For PyCUDA import try: import pycuda.driver as cuda cuda.init() for Id in range(cuda.Device.count()): device = cuda.Device(Id) print("Device #%i of type GPU : %s" % (Id, device.name())) print except Exception: print("Your platform does not seem to support CUDA") sys.exit() elif opt in ("-d", "--device"): # Devices.append(int(arg)) Device = int(arg) elif opt in ("-g", "--gpustyle"): GpuStyle = arg elif opt in ("-v", "--variabletype"): VariableType = arg elif opt in ("-s", "--size"): Size = int(arg) elif opt in ("-k", "--trackpoints"): TrackPoints = int(arg) elif opt in ("-m", "--mass"): Mass = float(arg) elif opt in ("-i", "--internal"): InternalRadius = float(arg) elif opt in ("-e", "--external"): ExternalRadius = float(arg) elif opt in ("-a", "--angle"): Angle = numpy.pi / 180.0 * (90.0 - float(arg)) elif opt in ("-b", "--blackbody"): BlackBody = True elif opt in ("-j", "--tracksave"): TrackSave = True elif opt in ("-n", "--noimage"): NoImage = True elif opt in ("-o", "--method"): Method = arg elif opt in ("-t", "--threads"): Threads = int(arg) elif opt in ("-c", "--colors"): Colors = arg elif opt in ("-p", "--physics"): Physics = arg print("Device Identification selected : %s" % Device) print("GpuStyle used : %s" % GpuStyle) print("VariableType : %s" % VariableType) print("Size : %i" % Size) print("Mass : %f" % Mass) print("Internal Radius : %f" % InternalRadius) print("External Radius : %f" % ExternalRadius) print("Angle with normal of (in radians) : %f" % Angle) print("Black Body Disc Emission (monochromatic instead) : %s" % BlackBody) print("Method of resolution : %s" % Method) print("Colors for output images : %s" % Colors) print("Physics used for Trajectories : %s" % Physics) print("Trackpoints of Trajectories : %i" % TrackPoints) print("Tracksave of Trajectories : %i" % TrackSave) if GpuStyle == "CUDA": print("\nSelection of CUDA device") try: # For PyCUDA import import pycuda.driver as cuda cuda.init() for Id in range(cuda.Device.count()): device = cuda.Device(Id) print("Device #%i of type GPU : %s" % (Id, device.name())) if Id in Devices: Alu[Id] = "GPU" except ImportError: print("Platform does not seem to support CUDA") if GpuStyle == "OpenCL": print("\nSelection of OpenCL device") try: # For PyOpenCL import import pyopencl as cl Id = 0 for platform in cl.get_platforms(): for device in platform.get_devices(): # deviceType=cl.device_type.to_string(device.type) deviceType = "xPU" print( "Device #%i from %s of type %s : %s" % ( Id, platform.vendor.lstrip().rstrip(), deviceType, device.name.lstrip().rstrip(), ) ) if Id in Devices: # Set the Alu as detected Device Type Alu[Id] = deviceType Id = Id + 1 except ImportError: print("Platform does not seem to support OpenCL") zImage = numpy.zeros((Size, Size), dtype=numpy.float32) fImage = numpy.zeros((Size, Size), dtype=numpy.float32) InputCL = {} InputCL["Device"] = Device InputCL["GpuStyle"] = GpuStyle InputCL["VariableType"] = VariableType InputCL["Size"] = Size InputCL["Mass"] = Mass InputCL["InternalRadius"] = InternalRadius InputCL["ExternalRadius"] = ExternalRadius InputCL["Angle"] = Angle InputCL["BlackBody"] = BlackBody InputCL["Method"] = Method InputCL["TrackPoints"] = TrackPoints InputCL["Physics"] = Physics InputCL["Threads"] = Threads InputCL["NoImage"] = NoImage InputCL["TrackSave"] = TrackSave if GpuStyle == "OpenCL": duration = BlackHoleCL(zImage, fImage, InputCL) else: duration = BlackHoleCUDA(zImage, fImage, InputCL) Hostname = gethostname() Date = time.strftime("%Y%m%d_%H%M%S") ImageInfo = "%s_Device%i_%s_%s" % (Method, Device, Hostname, Date) if not NoImage: ImageOutput(zImage, "TrouNoirZ_%s" % ImageInfo, Colors) ImageOutput(fImage, "TrouNoirF_%s" % ImageInfo, Colors) pyopencl-2025.1/examples/demo-struct-reduce.py0000644000000000000000000000344014332717401016270 0ustar00import numpy as np import pyopencl as cl def make_collector_dtype(device): dtype = np.dtype([ ("cur_min", np.int32), ("cur_max", np.int32), ("pad", np.int32), ]) name = "minmax_collector" from pyopencl.tools import get_or_register_dtype, match_dtype_to_c_struct dtype, c_decl = match_dtype_to_c_struct(device, name, dtype) dtype = get_or_register_dtype(name, dtype) return dtype, c_decl ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) mmc_dtype, mmc_c_decl = make_collector_dtype(ctx.devices[0]) preamble = mmc_c_decl + r"""//CL// minmax_collector mmc_neutral() { // FIXME: needs infinity literal in real use, ok here minmax_collector result; result.cur_min = 1<<30; result.cur_max = -(1<<30); return result; } minmax_collector mmc_from_scalar(float x) { minmax_collector result; result.cur_min = x; result.cur_max = x; return result; } minmax_collector agg_mmc(minmax_collector a, minmax_collector b) { minmax_collector result = a; if (b.cur_min < result.cur_min) result.cur_min = b.cur_min; if (b.cur_max > result.cur_max) result.cur_max = b.cur_max; return result; } """ from pyopencl.clrandom import rand as clrand a_gpu = clrand(queue, (20000,), dtype=np.int32, a=0, b=10**6) a = a_gpu.get() from pyopencl.reduction import ReductionKernel red = ReductionKernel(ctx, mmc_dtype, neutral="mmc_neutral()", reduce_expr="agg_mmc(a, b)", map_expr="mmc_from_scalar(x[i])", arguments="__global int *x", preamble=preamble) minmax = red(a_gpu).get() assert abs(minmax["cur_min"] - np.min(a)) < 1e-5 assert abs(minmax["cur_max"] - np.max(a)) < 1e-5 pyopencl-2025.1/examples/demo.py0000644000000000000000000000176514332717401013511 0ustar00#!/usr/bin/env python import numpy as np import pyopencl as cl rng = np.random.default_rng() a_np = rng.random(50000, dtype=np.float32) b_np = rng.random(50000, dtype=np.float32) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) mf = cl.mem_flags a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np) b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np) prg = cl.Program(ctx, """ __kernel void sum( __global const float *a_g, __global const float *b_g, __global float *res_g) { int gid = get_global_id(0); res_g[gid] = a_g[gid] + b_g[gid]; } """).build() res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes) knl = prg.sum # Use this Kernel object for repeated calls knl(queue, a_np.shape, None, a_g, b_g, res_g) res_np = np.empty_like(a_np) cl.enqueue_copy(queue, res_np, res_g) # Check on CPU with Numpy: error_np = res_np - (a_np + b_np) print(f"Error:\n{error_np}") print(f"Norm: {np.linalg.norm(error_np):.16e}") assert np.allclose(res_np, a_np + b_np) pyopencl-2025.1/examples/demo_array.py0000644000000000000000000000150614332717401014700 0ustar00import numpy as np import numpy.linalg as la import pyopencl as cl import pyopencl.array as cl_array rng = np.random.default_rng() a = rng.random(50000, dtype=np.float32) b = rng.random(50000, dtype=np.float32) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) a_dev = cl_array.to_device(queue, a) b_dev = cl_array.to_device(queue, b) dest_dev = cl_array.empty_like(a_dev) prg = cl.Program(ctx, """ __kernel void sum(__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } """).build() knl = prg.sum # Use this Kernel object for repeated calls knl(queue, a.shape, None, a_dev.data, b_dev.data, dest_dev.data) print(la.norm((dest_dev - (a_dev+b_dev)).get())) assert np.allclose(dest_dev.get(), (a_dev + b_dev).get()) pyopencl-2025.1/examples/demo_array_svm.py0000644000000000000000000000165214332717401015567 0ustar00import numpy as np import pyopencl as cl import pyopencl.array as cl_array from pyopencl.tools import SVMAllocator, SVMPool n = 50000 rng = np.random.default_rng() a = rng.random(n, dtype=np.float32) b = rng.random(n, dtype=np.float32) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) alloc = SVMAllocator(ctx, alignment=0, queue=queue) alloc = SVMPool(alloc) a_dev = cl_array.to_device(queue, a, allocator=alloc) b_dev = cl_array.to_device(queue, b, allocator=alloc) dest_dev = cl_array.empty_like(a_dev) prg = cl.Program(ctx, """ __kernel void sum(__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } """).build() knl = prg.sum knl(queue, a.shape, None, a_dev.data, b_dev.data, dest_dev.data) print(np.linalg.norm((dest_dev - (a_dev + b_dev)).get())) assert np.allclose(dest_dev.get(), (a_dev + b_dev).get()) pyopencl-2025.1/examples/demo_elementwise.py0000644000000000000000000000146114332717401016103 0ustar00import numpy as np import pyopencl as cl import pyopencl.array from pyopencl.elementwise import ElementwiseKernel n = 10 rng = np.random.default_rng() a_np = rng.random(n, dtype=np.float32) b_np = rng.random(n, dtype=np.float32) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) a_g = cl.array.to_device(queue, a_np) b_g = cl.array.to_device(queue, b_np) lin_comb = ElementwiseKernel(ctx, "float k1, float *a_g, float k2, float *b_g, float *res_g", "res_g[i] = k1 * a_g[i] + k2 * b_g[i]", "lin_comb") res_g = cl.array.empty_like(a_g) lin_comb(2, a_g, 3, b_g, res_g) # Check on GPU with PyOpenCL Array: print((res_g - (2 * a_g + 3 * b_g)).get()) # Check on CPU with Numpy: res_np = res_g.get() print(res_np - (2 * a_np + 3 * b_np)) print(np.linalg.norm(res_np - (2 * a_np + 3 * b_np))) pyopencl-2025.1/examples/demo_elementwise_complex.py0000644000000000000000000000276514332717401017642 0ustar00import numpy as np import numpy.linalg as la import pyopencl as cl import pyopencl.array as cl_array from pyopencl.elementwise import ElementwiseKernel ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) n = 10 rng = np.random.default_rng() a_gpu = cl_array.to_device(queue, rng.standard_normal(n, dtype=np.float32) + 1j*rng.standard_normal(n, dtype=np.float32)) b_gpu = cl_array.to_device(queue, rng.standard_normal(n, dtype=np.float32) + 1j*rng.standard_normal(n, dtype=np.float32)) complex_prod = ElementwiseKernel(ctx, "float a, " "cfloat_t *x, " "cfloat_t *y, " "cfloat_t *z", "z[i] = cfloat_rmul(a, cfloat_mul(x[i], y[i]))", "complex_prod", preamble="#include ") complex_add = ElementwiseKernel(ctx, "cfloat_t *x, " "cfloat_t *y, " "cfloat_t *z", "z[i] = cfloat_add(x[i], y[i])", "complex_add", preamble="#include ") real_part = ElementwiseKernel(ctx, "cfloat_t *x, float *z", "z[i] = cfloat_real(x[i])", "real_part", preamble="#include ") c_gpu = cl_array.empty_like(a_gpu) complex_prod(5, a_gpu, b_gpu, c_gpu) c_gpu_real = cl_array.empty(queue, len(a_gpu), dtype=np.float32) real_part(c_gpu, c_gpu_real) print(c_gpu.get().real - c_gpu_real.get()) print(la.norm(c_gpu.get() - (5*a_gpu.get()*b_gpu.get()))) assert la.norm(c_gpu.get() - (5*a_gpu.get()*b_gpu.get())) < 1e-5 pyopencl-2025.1/examples/demo_mandelbrot.py0000644000000000000000000001167314332717401015717 0ustar00# I found this example for PyCuda here: # http://wiki.tiker.net/PyCuda/Examples/Mandelbrot # # An improved sequential/pure Python code was contributed # by CRVSADER//KY . # # I adapted it for PyOpenCL. Hopefully it is useful to someone. # July 2010, HolgerRapp@gmx.net # # Original readme below these lines. # Mandelbrot calculate using GPU, Serial numpy and faster numpy # Use to show the speed difference between CPU and GPU calculations # ian@ianozsvald.com March 2010 # Based on vegaseat's TKinter/numpy example code from 2006 # http://www.daniweb.com/code/snippet216851.html# # with minor changes to move to numpy from the obsolete Numeric import time import numpy as np from PIL import Image import pyopencl as cl # You can choose a calculation routine below (calc_fractal), uncomment # one of the three lines to test the three variations # Speed notes are listed in the same place # set width and height of window, more pixels take longer to calculate w = 2048 h = 2048 def calc_fractal_opencl(q, maxiter): ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) output = np.empty(q.shape, dtype=np.uint16) mf = cl.mem_flags q_opencl = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=q) output_opencl = cl.Buffer(ctx, mf.WRITE_ONLY, output.nbytes) prg = cl.Program( ctx, """ #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable __kernel void mandelbrot(__global float2 *q, __global ushort *output, ushort const maxiter) { int gid = get_global_id(0); float nreal, real = 0; float imag = 0; output[gid] = 0; for(int curiter = 0; curiter < maxiter; curiter++) { nreal = real*real - imag*imag + q[gid].x; imag = 2* real*imag + q[gid].y; real = nreal; if (real*real + imag*imag > 4.0f) { output[gid] = curiter; break; } } } """, ).build() prg.mandelbrot( queue, output.shape, None, q_opencl, output_opencl, np.uint16(maxiter) ) cl.enqueue_copy(queue, output, output_opencl).wait() return output def calc_fractal_serial(q, maxiter): # calculate z using pure python on a numpy array # note that, unlike the other two implementations, # the number of iterations per point is NOT constant z = np.zeros(q.shape, complex) output = np.resize( np.array( 0, ), q.shape, ) for i in range(len(q)): for iter in range(maxiter): z[i] = z[i] * z[i] + q[i] if abs(z[i]) > 2.0: output[i] = iter break return output def calc_fractal_numpy(q, maxiter): # calculate z using numpy, this is the original # routine from vegaseat's URL output = np.resize( np.array( 0, ), q.shape, ) z = np.zeros(q.shape, np.complex64) for it in range(maxiter): z = z * z + q done = np.greater(abs(z), 2.0) q = np.where(done, 0 + 0j, q) z = np.where(done, 0 + 0j, z) output = np.where(done, it, output) return output # choose your calculation routine here by uncommenting one of the options calc_fractal = calc_fractal_opencl # calc_fractal = calc_fractal_serial # calc_fractal = calc_fractal_numpy class Mandelbrot: def draw(self, x1, x2, y1, y2, maxiter=30): # draw the Mandelbrot set, from numpy example xx = np.arange(x1, x2, (x2 - x1) / w) yy = np.arange(y2, y1, (y1 - y2) / h) * 1j q = np.ravel(xx + yy[:, np.newaxis]).astype(np.complex64) start_main = time.time() output = calc_fractal(q, maxiter) end_main = time.time() secs = end_main - start_main print("Main took", secs) self.mandel = (output.reshape((h, w)) / float(output.max()) * 255.0).astype( np.uint8 ) def create_image(self): """ " create the image from the draw() string """ # you can experiment with these x and y ranges self.draw(-2.13, 0.77, -1.3, 1.3) self.im = Image.fromarray(self.mandel) self.im.putpalette([i for rgb in ((j, 0, 0) for j in range(255)) for i in rgb]) def create_label(self): # put the image on a label widget self.image = ImageTk.PhotoImage(self.im) self.label = tk.Label(self.root, image=self.image) self.label.pack() def run_tk(self): self.root = tk.Tk() self.root.title("Mandelbrot Set") self.create_image() self.create_label() # start event loop self.root.mainloop() if __name__ == "__main__": test = Mandelbrot() try: import tkinter as tk except ModuleNotFoundError: test.create_image() else: from PIL import ImageTk try: test.run_tk() except tk.TclError: test.create_image() pyopencl-2025.1/examples/demo_meta_codepy.py0000644000000000000000000000331614332717401016054 0ustar00import numpy as np import numpy.linalg as la from cgen import ( POD, Assign, Block, Const, FunctionBody, FunctionDeclaration, Initializer, Module, Pointer, Value, ) from cgen.opencl import CLGlobal, CLKernel, CLRequiredWorkGroupSize import pyopencl as cl local_size = 256 thread_strides = 32 macroblock_count = 33 dtype = np.float32 total_size = local_size*thread_strides*macroblock_count ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) rng = np.random.default_rng() a = rng.standard_normal(total_size, dtype=dtype) b = rng.standard_normal(total_size, dtype=dtype) mf = cl.mem_flags a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b) c_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes) mod = Module([ FunctionBody( CLKernel(CLRequiredWorkGroupSize((local_size,), FunctionDeclaration( Value("void", "add"), arg_decls=[CLGlobal(Pointer(Const(POD(dtype, name)))) for name in ["tgt", "op1", "op2"]]))), Block([ Initializer(POD(np.int32, "idx"), "get_local_id(0) + %d * get_group_id(0)" % (local_size*thread_strides)) ]+[ Assign( "tgt[idx+%d]" % (o*local_size), "op1[idx+%d] + op2[idx+%d]" % ( o*local_size, o*local_size)) for o in range(thread_strides)]))]) knl = cl.Program(ctx, str(mod)).build().add knl(queue, (local_size*macroblock_count,), (local_size,), c_buf, a_buf, b_buf) c = np.empty_like(a) cl.enqueue_copy(queue, c, c_buf).wait() assert la.norm(c-(a+b)) == 0 pyopencl-2025.1/examples/demo_meta_template.py0000644000000000000000000000270314332717401016403 0ustar00import numpy as np import numpy.linalg as la from mako.template import Template import pyopencl as cl local_size = 256 thread_strides = 32 macroblock_count = 33 dtype = np.float32 total_size = local_size*thread_strides*macroblock_count ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) rng = np.random.default_rng() a = rng.standard_normal(total_size, dtype=dtype) b = rng.standard_normal(total_size, dtype=dtype) mf = cl.mem_flags a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b) c_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes) tpl = Template(""" __kernel void add( __global ${ type_name } *tgt, __global const ${ type_name } *op1, __global const ${ type_name } *op2) { int idx = get_local_id(0) + ${ local_size } * ${ thread_strides } * get_group_id(0); % for i in range(thread_strides): <% offset = i*local_size %> tgt[idx + ${ offset }] = op1[idx + ${ offset }] + op2[idx + ${ offset } ]; % endfor }""") rendered_tpl = tpl.render(type_name="float", local_size=local_size, thread_strides=thread_strides) knl = cl.Program(ctx, str(rendered_tpl)).build().add knl(queue, (local_size*macroblock_count,), (local_size,), c_buf, a_buf, b_buf) c = np.empty_like(a) cl.enqueue_copy(queue, c, c_buf).wait() assert la.norm(c-(a+b)) == 0 pyopencl-2025.1/examples/dump-performance.py0000644000000000000000000000246414332717401016026 0ustar00import pyopencl as cl import pyopencl.characterize.performance as perf def main(): ctx = cl.create_some_context() prof_overhead, latency = perf.get_profiling_overhead(ctx) print("command latency: %g s" % latency) print("profiling overhead: {:g} s -> {:.1f} %".format( prof_overhead, 100*prof_overhead/latency)) queue = cl.CommandQueue( ctx, properties=cl.command_queue_properties.PROFILING_ENABLE) print("empty kernel: %g s" % perf.get_empty_kernel_time(queue)) print("float32 add: %g GOps/s" % (perf.get_add_rate(queue)/1e9)) for tx_type in [ perf.HostToDeviceTransfer, perf.DeviceToHostTransfer, perf.DeviceToDeviceTransfer]: print("----------------------------------------") print(tx_type.__name__) print("----------------------------------------") print("latency: %g s" % perf.transfer_latency(queue, tx_type)) for i in range(6, 31, 2): bs = 1 << i try: result = "%g GB/s" % ( perf.transfer_bandwidth(queue, tx_type, bs)/1e9) except Exception as e: result = "exception: %s" % e.__class__.__name__ print("bandwidth @ %d bytes: %s" % (bs, result)) if __name__ == "__main__": main() pyopencl-2025.1/examples/dump-properties.py0000644000000000000000000000632614332717401015722 0ustar00from optparse import OptionParser import pyopencl as cl parser = OptionParser() parser.add_option("-s", "--short", action="store_true", help="don't print all device properties") (options, args) = parser.parse_args() def print_info(obj, info_cls): for info_name in sorted(dir(info_cls)): if not info_name.startswith("_") and info_name != "to_string": info = getattr(info_cls, info_name) try: info_value = obj.get_info(info) except Exception: info_value = "" if (info_cls == cl.device_info and info_name == "PARTITION_TYPES_EXT" and isinstance(info_value, list)): print("{}: {}".format(info_name, [ cl.device_partition_property_ext.to_string(v, "") for v in info_value])) else: try: print(f"{info_name}: {info_value}") except Exception: print("%s: " % info_name) for platform in cl.get_platforms(): print(75*"=") print(platform) print(75*"=") if not options.short: print_info(platform, cl.platform_info) for device in platform.get_devices(): if not options.short: print(75*"-") print(device) if not options.short: print(75*"-") print_info(device, cl.device_info) ctx = cl.Context([device]) for mf in [ cl.mem_flags.READ_ONLY, # cl.mem_flags.READ_WRITE, # cl.mem_flags.WRITE_ONLY ]: for itype in [ cl.mem_object_type.IMAGE2D, cl.mem_object_type.IMAGE3D ]: try: formats = cl.get_supported_image_formats(ctx, mf, itype) except Exception: formats = "" else: def str_chd_type(chdtype): result = cl.channel_type.to_string(chdtype, "") result = result.replace("_INT", "") result = result.replace("UNSIGNED", "U") result = result.replace("SIGNED", "S") result = result.replace("NORM", "N") result = result.replace("FLOAT", "F") return result formats = ", ".join( "{}-{}".format( cl.channel_order.to_string(iform.channel_order, ""), str_chd_type(iform.channel_data_type)) for iform in formats) print("{} {} FORMATS: {}\n".format( cl.mem_object_type.to_string(itype), cl.mem_flags.to_string(mf), formats)) del ctx pyopencl-2025.1/examples/gl_interop_demo.py0000644000000000000000000000460314332717401015725 0ustar00from OpenGL.GL import * from OpenGL.GLUT import * from OpenGL.raw.GL.VERSION.GL_1_5 import glBufferData as rawGlBufferData import pyopencl as cl n_vertices = 10000 src = """ __kernel void generate_sin(__global float2* a) { int id = get_global_id(0); int n = get_global_size(0); float r = (float)id / (float)n; float x = r * 16.0f * 3.1415f; a[id].x = r * 2.0f - 1.0f; a[id].y = native_sin(x); } """ def initialize(): platform = cl.get_platforms()[0] import sys from pyopencl.tools import get_gl_sharing_context_properties if sys.platform == "darwin": ctx = cl.Context(properties=get_gl_sharing_context_properties(), devices=[]) else: # Some OSs prefer clCreateContextFromType, some prefer # clCreateContext. Try both. try: ctx = cl.Context(properties=[ (cl.context_properties.PLATFORM, platform)] + get_gl_sharing_context_properties()) except: ctx = cl.Context(properties=[ (cl.context_properties.PLATFORM, platform)] + get_gl_sharing_context_properties(), devices = [platform.get_devices()[0]]) glClearColor(1, 1, 1, 1) glColor(0, 0, 1) vbo = glGenBuffers(1) glBindBuffer(GL_ARRAY_BUFFER, vbo) rawGlBufferData(GL_ARRAY_BUFFER, n_vertices * 2 * 4, None, GL_STATIC_DRAW) glEnableClientState(GL_VERTEX_ARRAY) glVertexPointer(2, GL_FLOAT, 0, None) coords_dev = cl.GLBuffer(ctx, cl.mem_flags.READ_WRITE, int(vbo)) prog = cl.Program(ctx, src).build() queue = cl.CommandQueue(ctx) cl.enqueue_acquire_gl_objects(queue, [coords_dev]) prog.generate_sin(queue, (n_vertices,), None, coords_dev) cl.enqueue_release_gl_objects(queue, [coords_dev]) queue.finish() glFlush() def display(): glClear(GL_COLOR_BUFFER_BIT) glDrawArrays(GL_LINE_STRIP, 0, n_vertices) glFlush() def reshape(w, h): glViewport(0, 0, w, h) glMatrixMode(GL_PROJECTION) glLoadIdentity() glMatrixMode(GL_MODELVIEW) if __name__ == '__main__': import sys glutInit(sys.argv) if len(sys.argv) > 1: n_vertices = int(sys.argv[1]) glutInitWindowSize(800, 160) glutInitWindowPosition(0, 0) glutCreateWindow('OpenCL/OpenGL Interop Tutorial: Sin Generator') glutDisplayFunc(display) glutReshapeFunc(reshape) initialize() glutMainLoop() pyopencl-2025.1/examples/gl_particle_animation.py0000644000000000000000000001460614332717401017107 0ustar00# Visualization of particles with gravity # Source: http://enja.org/2010/08/27/adventures-in-opencl-part-2-particles-with-opengl/ import sys import numpy as np from OpenGL import GL, GLU, GLUT from OpenGL.arrays import vbo from OpenGL.GL import ( GL_ARRAY_BUFFER, GL_BLEND, GL_COLOR_ARRAY, GL_COLOR_BUFFER_BIT, GL_DEPTH_BUFFER_BIT, GL_DYNAMIC_DRAW, GL_FLOAT, GL_MODELVIEW, GL_ONE_MINUS_SRC_ALPHA, GL_POINT_SMOOTH, GL_POINTS, GL_PROJECTION, GL_SRC_ALPHA, GL_VERTEX_ARRAY) from OpenGL.GLUT import GLUT_DEPTH, GLUT_DOUBLE, GLUT_RGBA import pyopencl as cl from pyopencl.tools import get_gl_sharing_context_properties mf = cl.mem_flags width = 800 height = 600 num_particles = 100000 time_step = 0.005 mouse_down = False mouse_old = {"x": 0.0, "y": 0.0} rotate = {"x": 0.0, "y": 0.0, "z": 0.0} translate = {"x": 0.0, "y": 0.0, "z": 0.0} initial_translate = {"x": 0.0, "y": 0.0, "z": -2.5} def glut_window(): GLUT.glutInit(sys.argv) GLUT.glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE | GLUT_DEPTH) GLUT.glutInitWindowSize(width, height) GLUT.glutInitWindowPosition(0, 0) window = GLUT.glutCreateWindow("Particle Simulation") GLUT.glutDisplayFunc(on_display) # Called by GLUT every frame GLUT.glutKeyboardFunc(on_key) GLUT.glutMouseFunc(on_click) GLUT.glutMotionFunc(on_mouse_move) GLUT.glutTimerFunc(10, on_timer, 10) # Call draw every 30 ms GL.glViewport(0, 0, width, height) GL.glMatrixMode(GL_PROJECTION) GL.glLoadIdentity() GLU.gluPerspective(60.0, width / float(height), 0.1, 1000.0) return window def initial_buffers(num_particles): rng = np.random.default_rng() np_position = np.empty((num_particles, 4), dtype=np.float32) np_color = np.empty((num_particles, 4), dtype=np.float32) np_velocity = np.empty((num_particles, 4), dtype=np.float32) np_position[:, 0] = np.sin( np.arange(0.0, num_particles) * 2.001 * np.pi / num_particles ) np_position[:, 0] *= rng.integers(num_particles) / 3.0 + 0.2 np_position[:, 1] = np.cos( np.arange(0.0, num_particles) * 2.001 * np.pi / num_particles ) np_position[:, 1] *= rng.integers(num_particles) / 3.0 + 0.2 np_position[:, 2] = 0.0 np_position[:, 3] = 1.0 np_color[:, :] = [1.0, 1.0, 1.0, 1.0] # White particles np_velocity[:, 0] = np_position[:, 0] * 2.0 np_velocity[:, 1] = np_position[:, 1] * 2.0 np_velocity[:, 2] = 3.0 np_velocity[:, 3] = rng.integers(num_particles) gl_position = vbo.VBO( data=np_position, usage=GL_DYNAMIC_DRAW, target=GL_ARRAY_BUFFER ) gl_position.bind() gl_color = vbo.VBO(data=np_color, usage=GL_DYNAMIC_DRAW, target=GL_ARRAY_BUFFER) gl_color.bind() return (np_position, np_velocity, gl_position, gl_color) def on_timer(t): GLUT.glutTimerFunc(t, on_timer, t) GLUT.glutPostRedisplay() def on_key(*args): if args[0] == "\033" or args[0] == "q": sys.exit() def on_click(button, state, x, y): mouse_old["x"] = x mouse_old["y"] = y def on_mouse_move(x, y): rotate["x"] += (y - mouse_old["y"]) * 0.2 rotate["y"] += (x - mouse_old["x"]) * 0.2 mouse_old["x"] = x mouse_old["y"] = y def on_display(): """Render the particles""" # Update or particle positions by calling the OpenCL kernel cl.enqueue_acquire_gl_objects(queue, [cl_gl_position, cl_gl_color]) kernelargs = ( cl_gl_position, cl_gl_color, cl_velocity, cl_start_position, cl_start_velocity, np.float32(time_step), ) program.particle_fountain(queue, (num_particles,), None, *(kernelargs)) cl.enqueue_release_gl_objects(queue, [cl_gl_position, cl_gl_color]) queue.finish() GL.glFlush() GL.glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT) GL.glMatrixMode(GL_MODELVIEW) GL.glLoadIdentity() # Handle mouse transformations GL.glTranslatef(initial_translate["x"], initial_translate["y"], initial_translate["z"]) GL.glRotatef(rotate["x"], 1, 0, 0) GL.glRotatef(rotate["y"], 0, 1, 0) # we switched around the axis so make this rotate_z GL.glTranslatef(translate["x"], translate["y"], translate["z"]) # Render the particles GL.glEnable(GL_POINT_SMOOTH) GL.glPointSize(2) GL.glEnable(GL_BLEND) GL.glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA) # Set up the VBOs gl_color.bind() GL.glColorPointer(4, GL_FLOAT, 0, gl_color) gl_position.bind() GL.glVertexPointer(4, GL_FLOAT, 0, gl_position) GL.glEnableClientState(GL_VERTEX_ARRAY) GL.glEnableClientState(GL_COLOR_ARRAY) # Draw the VBOs GL.glDrawArrays(GL_POINTS, 0, num_particles) GL.glDisableClientState(GL_COLOR_ARRAY) GL.glDisableClientState(GL_VERTEX_ARRAY) GL.glDisable(GL_BLEND) GLUT.glutSwapBuffers() window = glut_window() (np_position, np_velocity, gl_position, gl_color) = initial_buffers(num_particles) platform = cl.get_platforms()[0] context = cl.Context( properties=[(cl.context_properties.PLATFORM, platform)] + get_gl_sharing_context_properties() ) queue = cl.CommandQueue(context) cl_velocity = cl.Buffer(context, mf.COPY_HOST_PTR, hostbuf=np_velocity) cl_start_position = cl.Buffer( context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=np_position ) cl_start_velocity = cl.Buffer( context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=np_velocity ) cl_gl_position = cl.GLBuffer(context, mf.READ_WRITE, int(gl_position)) cl_gl_color = cl.GLBuffer(context, mf.READ_WRITE, int(gl_color)) kernel = """__kernel void particle_fountain(__global float4* position, __global float4* color, __global float4* velocity, __global float4* start_position, __global float4* start_velocity, float time_step) { unsigned int i = get_global_id(0); float4 p = position[i]; float4 v = velocity[i]; float life = velocity[i].w; life -= time_step; if (life <= 0.f) { p = start_position[i]; v = start_velocity[i]; life = 1.0f; } v.z -= 9.8f*time_step; p.x += v.x*time_step; p.y += v.y*time_step; p.z += v.z*time_step; v.w = life; position[i] = p; velocity[i] = v; color[i].w = life; /* Fade points as life decreases */ }""" program = cl.Program(context, kernel).build() GLUT.glutMainLoop() pyopencl-2025.1/examples/ipython-demo.ipynb0000644000000000000000000000764714332717401015677 0ustar00{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "cc7d0709", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import numpy as np\n", "\n", "import pyopencl as cl\n", "import pyopencl.array" ] }, { "cell_type": "markdown", "id": "8ac8d7bb", "metadata": {}, "source": [ "Load the PyOpenCL IPython extension:" ] }, { "cell_type": "code", "execution_count": null, "id": "7023ca2f", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "%load_ext pyopencl.ipython_ext" ] }, { "cell_type": "markdown", "id": "9544b53c", "metadata": {}, "source": [ "Create an OpenCL context and a command queue:" ] }, { "cell_type": "code", "execution_count": null, "id": "fac17999", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "ctx = cl.create_some_context(interactive=True)\n", "queue = cl.CommandQueue(ctx)" ] }, { "cell_type": "markdown", "id": "a29daf04", "metadata": {}, "source": [ "-----\n", "\n", "Define an OpenCL kernel using the `%%cl_kernel` magic:" ] }, { "cell_type": "code", "execution_count": null, "id": "65c7e6c9", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "%%cl_kernel -o \"-cl-fast-relaxed-math\"\n", "\n", "__kernel void sum_vector(__global const float *a,\n", "__global const float *b, __global float *c)\n", "{\n", " int gid = get_global_id(0);\n", " c[gid] = a[gid] + b[gid];\n", "}" ] }, { "cell_type": "markdown", "id": "cfb57357", "metadata": {}, "source": [ "This looks for `cl_ctx` or `ctx` in the user namespace to find a PyOpenCL context.\n", "\n", "Kernel names are automatically injected into the user namespace, so we can just use `sum_vector` from Python below.\n", "\n", "Now create some data to work on:" ] }, { "cell_type": "code", "execution_count": null, "id": "1d80ff38", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "n = 10000\n", "\n", "a = cl.array.empty(queue, n, dtype=np.float32)\n", "a.fill(15)\n", "\n", "rng = np.random.default_rng()\n", "b_host = rng.normal(size=n).astype(np.float32)\n", "b = cl.array.to_device(queue, b_host)\n", "\n", "c = cl.array.empty_like(a)" ] }, { "cell_type": "markdown", "id": "61fccb61", "metadata": {}, "source": [ "Run the kernel:" ] }, { "cell_type": "code", "execution_count": null, "id": "2ba991b3", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "sum_vector(queue, (n,), None, a.data, b.data, c.data) # noqa: F821" ] }, { "cell_type": "markdown", "id": "11a55b38", "metadata": {}, "source": [ "Check the result using `numpy` operations:" ] }, { "cell_type": "code", "execution_count": null, "id": "ee3560c1", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "assert (c.get() == b_host + 15).all()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 5 } pyopencl-2025.1/examples/median-filter.py0000644000000000000000000000513014332717401015273 0ustar00import numpy as np from imageio import imread, imsave import pyopencl as cl # Read in image img = imread("noisyImage.jpg").astype(np.float32) print(img.shape) img = np.mean(img, axis=2) print(img.shape) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) mf = cl.mem_flags # Kernel function src = """ void sort(int *a, int *b, int *c) { int swap; if(*a > *b) { swap = *a; *a = *b; *b = swap; } if(*a > *c) { swap = *a; *a = *c; *c = swap; } if(*b > *c) { swap = *b; *b = *c; *c = swap; } } __kernel void medianFilter( __global float *img, __global float *result, __global int *width, __global int *height) { int w = *width; int h = *height; int posx = get_global_id(1); int posy = get_global_id(0); int i = w*posy + posx; // Keeping the edge pixels the same if( posx == 0 || posy == 0 || posx == w-1 || posy == h-1 ) { result[i] = img[i]; } else { int pixel00, pixel01, pixel02, pixel10, pixel11, pixel12, pixel20, pixel21, pixel22; pixel00 = img[i - 1 - w]; pixel01 = img[i- w]; pixel02 = img[i + 1 - w]; pixel10 = img[i - 1]; pixel11 = img[i]; pixel12 = img[i + 1]; pixel20 = img[i - 1 + w]; pixel21 = img[i + w]; pixel22 = img[i + 1 + w]; //sort the rows sort( &(pixel00), &(pixel01), &(pixel02) ); sort( &(pixel10), &(pixel11), &(pixel12) ); sort( &(pixel20), &(pixel21), &(pixel22) ); //sort the columns sort( &(pixel00), &(pixel10), &(pixel20) ); sort( &(pixel01), &(pixel11), &(pixel21) ); sort( &(pixel02), &(pixel12), &(pixel22) ); //sort the diagonal sort( &(pixel00), &(pixel11), &(pixel22) ); // median is the the middle value of the diagonal result[i] = pixel11; } } """ # Kernel function instantiation prg = cl.Program(ctx, src).build() # Allocate memory for variables on the device img_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=img) result_g = cl.Buffer(ctx, mf.WRITE_ONLY, img.nbytes) width_g = cl.Buffer( ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=np.int32(img.shape[1]) ) height_g = cl.Buffer( ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=np.int32(img.shape[0]) ) # Call Kernel. Automatically takes care of block/grid distribution prg.medianFilter(queue, img.shape, None, img_g, result_g, width_g, height_g) result = np.empty_like(img) cl.enqueue_copy(queue, result, result_g) # Show the blurred image imsave("medianFilter-OpenCL.jpg", result, mode="RGB") pyopencl-2025.1/examples/n-body.py0000644000000000000000000007640414332717401013757 0ustar00#!/usr/bin/env python3 """ NBody Demonstrator implemented in OpenCL, rendering OpenGL By default, rendering in OpenGL is disabled. Add -g option to activate. Part of matrix programs from: https://forge.cbp.ens-lyon.fr/svn/bench4gpu/ CC BY-NC-SA 2011 : Emmanuel QUEMENER Cecill v2 : Emmanuel QUEMENER Thanks to Andreas Klockner for PyOpenCL: http://mathema.tician.de/software/pyopencl """ import getopt import sys import time import numpy as np import pyopencl as cl import pyopencl.array def DictionariesAPI(): Marsaglia = {"CONG": 0, "SHR3": 1, "MWC": 2, "KISS": 3} Computing = {"FP32": 0, "FP64": 1} Interaction = {"Force": 0, "Potential": 1} Artevasion = {"None": 0, "NegExp": 1, "CorRad": 2} return (Marsaglia, Computing, Interaction, Artevasion) BlobOpenCL = """ #define TFP32 0 #define TFP64 1 #define TFORCE 0 #define TPOTENTIAL 1 #define NONE 0 #define NEGEXP 1 #define CORRAD 2 #if TYPE == TFP32 #define MYFLOAT4 float4 #define MYFLOAT8 float8 #define MYFLOAT float #define DISTANCE fast_distance #else #define MYFLOAT4 double4 #define MYFLOAT8 double8 #define MYFLOAT double #define DISTANCE distance #if defined(cl_khr_fp64) // Khronos extension available? #pragma OPENCL EXTENSION cl_khr_fp64 : enable #endif #endif #define znew ((zmwc=36969*(zmwc&65535)+(zmwc>>16))<<16) #define wnew ((wmwc=18000*(wmwc&65535)+(wmwc>>16))&65535) #define MWC (znew+wnew) #define SHR3 (jsr=(jsr=(jsr=jsr^(jsr<<17))^(jsr>>13))^(jsr<<5)) #define CONG (jcong=69069*jcong+1234567) #define KISS ((MWC^CONG)+SHR3) #define MWCfp (MYFLOAT)(MWC * 2.3283064365386963e-10f) #define KISSfp (MYFLOAT)(KISS * 2.3283064365386963e-10f) #define SHR3fp (MYFLOAT)(SHR3 * 2.3283064365386963e-10f) #define CONGfp (MYFLOAT)(CONG * 2.3283064365386963e-10f) #define PI (MYFLOAT)3.141592653589793238e0f #define SMALL_NUM (MYFLOAT)1.e-9f #define CoreRadius (MYFLOAT)(1.e0f) // Create my own Distance implementation: distance buggy on Oland AMD chipset MYFLOAT MyDistance(MYFLOAT4 n,MYFLOAT4 m) { private MYFLOAT x2,y2,z2; x2=n.s0-m.s0; x2*=x2; y2=n.s1-m.s1; y2*=y2; z2=n.s2-m.s2; z2*=z2; return(sqrt(x2+y2+z2)); } // Potential between 2 m,n bodies MYFLOAT PairPotential(MYFLOAT4 m,MYFLOAT4 n) #if ARTEVASION == NEGEXP // Add exp(-r) to numerator to avoid divergence for low distances { MYFLOAT r=DISTANCE(n,m); return((-1.e0f+exp(-r))/r); } #elif ARTEVASION == CORRAD // Add Core Radius to avoid divergence for low distances { MYFLOAT r=DISTANCE(n,m); return(-1.e0f/sqrt(r*r+CoreRadius*CoreRadius)); } #else // Classical potential in 1/r { // return((MYFLOAT)(-1.e0f)/(MyDistance(m,n))); return((MYFLOAT)(-1.e0f)/(DISTANCE(n,m))); } #endif // Interaction based of Force as gradient of Potential MYFLOAT4 Interaction(MYFLOAT4 m,MYFLOAT4 n) #if INTERACTION == TFORCE #if ARTEVASION == NEGEXP // Force gradient of potential, set as (1-exp(-r))/r { private MYFLOAT r=MyDistance(n,m); private MYFLOAT num=1.e0f+exp(-r)*(r-1.e0f); return((n-m)*num/(MYFLOAT)(r*r*r)); } #elif ARTEVASION == CORRAD // Force gradient of potential, (Core Radius) set as 1/sqrt(r**2+CoreRadius**2) { private MYFLOAT r=MyDistance(n,m); private MYFLOAT den=sqrt(r*r+CoreRadius*CoreRadius); return((n-m)/(MYFLOAT)(den*den*den)); } #else // Simplest implementation of force (equals to acceleration) // seems to bo bad (numerous artevasions) // MYFLOAT4 InteractionForce(MYFLOAT4 m,MYFLOAT4 n) { private MYFLOAT r=MyDistance(n,m); return((n-m)/(MYFLOAT)(r*r*r)); } #endif #else // Force definited as gradient of potential // Estimate potential and proximate potential to estimate force { // 1/1024 seems to be a good factor: larger one provides bad results private MYFLOAT epsilon=(MYFLOAT)(1.e0f/1024); private MYFLOAT4 er=normalize(n-m); private MYFLOAT4 dr=er*(MYFLOAT)epsilon; return(er/epsilon*(PairPotential(m,n)-PairPotential(m+dr,n))); } #endif MYFLOAT AtomicPotential(__global MYFLOAT4* clDataX,int gid) { private MYFLOAT potential=(MYFLOAT)0.e0f; private MYFLOAT4 x=clDataX[gid]; for (int i=0;iRadius) { Position=(MYFLOAT4)((MWCfp-0.5e0f)*diameter,(MWCfp-0.5e0f)*diameter,(MWCfp-0.5e0f)*diameter,0.e0f); Length=(MYFLOAT)length((MYFLOAT4)Position); } clDataX[gid]=Position; barrier(CLK_GLOBAL_MEM_FENCE); } __kernel void InBoxSplutterPoints(__global MYFLOAT4* clDataX, MYFLOAT box, uint seed_z,uint seed_w) { int gid=get_global_id(0); uint zmwc=seed_z+gid; uint wmwc=seed_w-gid; private MYFLOAT Heat; for (int i=0;i Rotate around X axis") print("\t Rotate around Y axis") print("\t Rotate around Z axis") print("\t <-|+> Unzoom/Zoom") print("\t Toggle to display Positions or Velocities") print("\t Quit\n") wall_time_start = time.time() Durations = np.array([], dtype=MyFloat) print("Starting!") if OpenGL: import OpenGL.GL as gl import OpenGL.GLUT as glut global ViewRX, ViewRY, ViewRZ Iterations = 0 ViewRX, ViewRY, ViewRZ = 0.0, 0.0, 0.0 # Launch OpenGL Loop glut.glutInit(sys.argv) glut.glutInitDisplayMode(glut.GLUT_DOUBLE | glut.GLUT_RGB) glut.glutSetOption(glut.GLUT_ACTION_ON_WINDOW_CLOSE, glut.GLUT_ACTION_CONTINUE_EXECUTION) glut.glutInitWindowSize(512, 512) glut.glutCreateWindow(b"NBodyGL") setup_viewport() glut.glutReshapeFunc(reshape) glut.glutDisplayFunc(display) glut.glutIdleFunc(display) # glutMouseFunc(mouse) glut.glutSpecialFunc(special) glut.glutKeyboardFunc(keyboard) glut.glutMainLoop() else: for iteration in range(Iterations): Elapsed = MainOpenCL(clDataX, clDataV, Step, Method) if Verbose: # print("Duration of #%s iteration: %s" % (iteration,Elapsed)) cl.enqueue_copy(queue, MyDataX, clDataX) print("Positions for #%s iteration: %s" % (iteration, MyDataX)) else: sys.stdout.write(".") sys.stdout.flush() Durations = np.append(Durations, Elapsed) print("\nEnding!") MyRoutines.CenterOfMass(queue, (1, 1), None, clDataX, clCoM, np.int32(Number)) CLLaunch = MyRoutines.Potential(queue, (Number, 1), None, clDataX, clPotential) CLLaunch = MyRoutines.Kinetic(queue, (Number, 1), None, clDataV, clKinetic) CLLaunch.wait() cl.enqueue_copy(queue, MyCoM, clCoM) cl.enqueue_copy(queue, MyPotential, clPotential) cl.enqueue_copy(queue, MyKinetic, clKinetic) print("\nCenter Of Mass estimated: (%s,%s,%s)" % (MyCoM[0], MyCoM[1], MyCoM[2])) print( "Energy estimated: Viriel=%s Potential=%s Kinetic=%s\n" % ( np.sum(MyPotential) + 2.0 * np.sum(MyKinetic), np.sum(MyPotential), np.sum(MyKinetic), ) ) print( "Duration stats on device %s with %s iterations :\n\tMean:\t%s\n\tMedian:\t%s\n\tStddev:\t%s\n\tMin:\t%s\n\tMax:\t%s\n\n\tVariability:\t%s\n" % ( Device, Iterations, np.mean(Durations), np.median(Durations), np.std(Durations), np.min(Durations), np.max(Durations), np.std(Durations) / np.median(Durations), ) ) # FPS: 1/Elapsed FPS = np.ones(len(Durations)) FPS /= Durations print( "FPS stats on device %s with %s iterations :\n\tMean:\t%s\n\tMedian:\t%s\n\tStddev:\t%s\n\tMin:\t%s\n\tMax:\t%s\n" % ( Device, Iterations, np.mean(FPS), np.median(FPS), np.std(FPS), np.min(FPS), np.max(FPS), ) ) # Contraction of Square*Size*Hertz: Size*Size/Elapsed Squertz = np.ones(len(Durations)) Squertz *= Number * Number Squertz /= Durations print( "Squertz in log10 & complete stats on device %s with %s iterations :\n\tMean:\t%s\t%s\n\tMedian:\t%s\t%s\n\tStddev:\t%s\t%s\n\tMin:\t%s\t%s\n\tMax:\t%s\t%s\n" % ( Device, Iterations, np.log10(np.mean(Squertz)), np.mean(Squertz), np.log10(np.median(Squertz)), np.median(Squertz), np.log10(np.std(Squertz)), np.std(Squertz), np.log10(np.min(Squertz)), np.min(Squertz), np.log10(np.max(Squertz)), np.max(Squertz), ) ) clDataX.release() clDataV.release() clKinetic.release() clPotential.release() pyopencl-2025.1/examples/narray.py0000644000000000000000000000130314332717401014045 0ustar00# example by Roger Pau Monn'e import numpy as np import pyopencl as cl demo_r = np.empty((500, 5), dtype=np.uint32) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) mf = cl.mem_flags demo_buf = cl.Buffer(ctx, mf.WRITE_ONLY, demo_r.nbytes) prg = cl.Program(ctx, """ __kernel void demo(__global uint *demo) { int i; int gid = get_global_id(0); for(i=0; i<5;i++) { demo[gid*5+i] = (uint) 1; } }""") try: prg.build() except Exception: print("Error:") print(prg.get_build_info(ctx.devices[0], cl.program_build_info.LOG)) raise prg.demo(queue, (500,), None, demo_buf) cl.enqueue_copy(queue, demo_r, demo_buf).wait() for res in demo_r: print(res) pyopencl-2025.1/examples/noisyImage.jpg0000644000000000000000000020236614332717401015021 0ustar00JFIFHH!FExifII* (1 2i%ASCIIOnePlusONEPLUS A5000HHGIMP 2.10.222021:02:17 18:32:00NV"'^r     0100 2d2021:02:14 13:35:322021:02:14 13:35:32 ddASCII787938787938787938 NVWnM (H'X/'e# JdASCIIbinary comment2021:02:146<JFIFC    $.' ",#(7),01444'9=82<.342C  2!!22222222222222222222222222222222222222222222222222" }!1AQa"q2#BR$3br %&'()*456789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz w!1AQaq"2B #3Rbr $4%&'()*56789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz ?ZJZ)h ( ( Z((((((((((((ERREPEQEQEQEQEQEQEQEQEQEQEQEQE(4f@hLy: )kź!R9T ?F\bW?R}jm̶7(4T (((K,pJꑨ3+ y!XrH?:RU[  z?}δặPOPw@{E# ( ( ( ( ( (!,=i2)jT<}VV"K3z^%?׺Ke1xyKfFQ{WvD\#KԚQUt YǮsNFuӜi4{?~,H+֬m T%C)|U+׊^vHy =0QCG Tqcj+ǎtƱ-xٸoWR˖jI5tQP^]ae=#i2jywO6,.l=~UB̧?uuzP +(v)RkJP(]J[G7 ֬YSs*mn}LJX  ^_&hmw11GFk qC00Ċv\ 6 U9?QEpQ@Q@Q@Q@Q@Jaޠ= Ҹ |@)6v|ʎ溝uY/Wv ?X\V"+sR\NGrY}II֣.?d}:{Tj>+E #t\VZjt]7m' H!bg]R+]Ywy2^!n#qjBG:FܒvO~b):N\L(İ(((*KR-blH*u-D㊺+8PYUmF YԲ!\ip%O<~cq*zv5!.Bl ^Pg?,0=H< F9&pzO,]YI' ؊BX*qZ^'# -{GEq޲B]*CcoMtZaGGGפב7)D"?½r 6tta߸QEyEPEPE8%5"ԊklXyŵO?d1,)kR,gQ,k۹u 44bGWd+H{Eׇƪc -T!}R}OktHEgٻl' /k9tVY#f>S|~G٭Ede=eΘ۫hf(\?uN Kok#S+ÎI"?Һ +)M63v85ӄZ)ՊPl'w'Gdrh9ZnpO"y>M I05"IPFGD:SrBZ/ qy`' U eFz0 I# I#ڻF,ɻV^NqҲ. gںyӀ~@W*6P=Z&isK`Sېs^R{Gd{z9PŸf\WxVoj 3߭{P=UbXu%xGhRqNǥ{Q8֌ޝ),L':m$GZFW1t|59Ll?<Z˖YU#|g J͖|ڪKW9hy `ޫ ɴ]UQܡkm-Ģ8ڸTWL&,Z[^Ztmte3Tdڑ:m.C93pWj8#i'wZN[J(s0#@ƹ#'>*Uqp63Sim ,֬c$2T$zg=ZxvmJ_'4#sOW^J7 8Zq῟.GQ׼hZyy@bߵM \JDd?Zz5EFJrs^94kڅ 򌓞U0F|`v>t+'{O^'J: u<?{mkMXS(Hܘ8!"eZ=KerǾpk)u67t}fI,ktOzݻ\'U^E lq^ lZQZ#'mޗ#3ָMdzQS3i3@=) ƚHTh޴csKk;Pd7]qZLOaU.3E4<\RVklҵgl=ĥ mn?CX%;*ns_*wFR**3Hn2ʸW)[(feH',55M} .4ydҼ>G6w} ܳ @$օBQ'_a(Ɩ;X6 a8Pa=(^~xz `6 z:ʞ}iԬ/X'n{UeZLݸx}+CnݥS]%.zl& r0T*AѨz۩EJGR:7uk)a+Q0 lCP+ld]Һ-29p7u_? ti%m)?)LAhĪba3= >5t*)$8ر'>SM28=>zb`Z?3]V v FsR3jpϹJl-z_}I,6,MqțI ?:ܲ3 FyR:^zo>1S$U^ـ<=w;t*IA6?>􄏭7u智4\k>jMԛk"Ȉ5jQLuP,zUcdU13*kpk.ۯI#Ȫ)!nN@<=}+2FhFG{,diǖG%z.NWlhUֹghsvJtc8t`~ A 2Fx?J*cZ)m2R}{VwX˲􋔕Kd)B+"20+V/iy v|W{N&tJҟ<%em-Tiӑ+|Hb|GwqþҺssj,C%F~΋ E*5iutSZiXpxr]Nm5_ ØQkB6Q\5I9TkkRX`>ۄΖUx8L-~Yt6L}^1Er!bz%TZ2bx5&ܾ8 Pm'NvWD|ˆVNkN)1f"0Ibzc*ǖ z7:mmmcO%AYOo}x 4ST4t6(%>O_VSԧH5X0BhlڂOfX2*O5t)U츤2%D}*\6RHǁP[צDaTmքeo!1rh}J UtК2Wvk,XmpD@l :D l&Rci9TePIۻD)9 TAP)đR⣘| ;k&+ֳ5ZZG,1MZF1qLS5ݑsUAR-jp}U)SGZpLp3]U]Lv8K$lUZMA-+8X jh⬫񊚸W,8TPNrn53ԠP!Hih`g7#4sNIwe0}j2/4ݵd)tm%Oݚ֑#4YC54C[`u+1x ICC_PROFILElcms0mntrRGB XYZ !acspAPPL-lcms desc @cprt`6wtptchad,rXYZbXYZgXYZrTRC gTRC bTRC chrm4$dmndX$dmdd|$mluc enUS$GIMP built-in sRGBmluc enUSPublic DomainXYZ -sf32 B%nXYZ o8XYZ $XYZ bparaff Y [chrmT|L&g\mluc enUSGIMPmluc enUSsRGBC     C   ASCII J$HQ$H! $ P HA ƂVI H ( $$IAb (A$THT $ R *H@AP H,  DT XID뾜tw.3>?~{J=q-|盯/w TI8pV4'=޿;(Ox/۔qo|=vPPEj"$1u.w=Ugs{3Gq=W+¹k}z? I+_7v]AXF_^'m_>~oݟ>[^βz1ONqqcX}v:+Q)HO_sNY}9iw<F;w~G|#3[m{>r^_Cޞp*P =ܾn>~+y؉L2uקzM眹/{sfwIwc+'LAq99WofЂJ'sq\r]=Lvcx,OU#L9յ'rvsZ>E~CuX VS8qbU)2fNg6:ݭ;)zˋDكWAq{u\h 濹%Xilk2NjY}XM?lޣYRZ8[:G;KuWY{3oᾞ=rZI ZSby*:P'nٝLeۗ|5uq~N_kn!W侞lrդRG^-#YoT̶GYՌމs>w};qo&sշGэs5ϑ/|?PjX~D̍#Iܩn/gcg6uq?+՝o>K6 w狆][cZ7\󇎽u})Z-u>IUQnFsӝg_oݱ}S{kѝX屟2-_ _y~;Cz@XYn .;/_?{Ο..:ݴFmX>ztg럠=Ǹ(EtOG#=$ZUrtu6}ܜo;KYkWͳٱY>4TǓvͿy>(8tMUZn>o-1_NiQDg?lSO?d5id73tz9>Cy6 ::}6<<zm{/W%/̝St?z\O-'E2&问i6[s&qsz3?9?"?F9|w?C;\^oA-r=? m{cٌU]>iY^?ɹ'I}Ƴ{v;L|}j>-$Ǽ?G<Z̛M<7pV0n_dž &l{9v~jx乌6:M^7V$:[3_es:c]zc?|OpaXdȖlgt&rN.o=NwcWͮ?th͞cmm2ٹaǨ~g//B>MN^~GOvw<ޛ쵖.~k|yzjƮ_F/%9rq7so8nv%Y:g6V\3Zf{;jtLX]yʦ|zq-FGkן?O`OCpɼ"ߦ[ctLK=%cGӏ5G_yԹ6nz1Vs>\{oU@A ű(^míbj:M=u.رR:e<2vbFt{{o!3aM,ݸޞ}o˒Mo9rt/Le$nm2v秏-~_9R aZ$RUv",UꞹX3v<:f]|w>Wkr68ͯg-7L3m<[эN7;s:c;{i3bAߟ欑[>k\S:7^V9݋9Vw=ތP]g?Ixhy]?C񛆴mzq&=wx.8埨D 1(U*H>wT(|?>[Jy]o5ź37\sOoYxuxmq}IcэF-;rVhk}_a/qdeUDX$i*R:7j jM_GŶ*p޿ -r>_{c/Im[;K+</\mƷ,rtccpc>/ ߼D>=(Yc[iŝygw~ZO{;]{s'ojyϯOs|~t93lz1gux7}~߽7_79(""bu=\g?34w_>7^V}>?C?az\n|9WM[7|~[|s|/=&n9{bYYhN:eE羿1??m-deytU)X@MNvKowpë}+>:}-杖g,&~w8MN}4kyg;w/噾Ojlzq~;1%7g3_yW3[FoVy/g.޿KJL;R1fDV^랾gYw;Gi{oƫuxyN:f:Ng,S__vntcjwS 7kg_72HJ+zg%R2i[õkZɭn=sSw_=ٵ\~F eFs~5y 7VGlf1ᇞr5vwWgӏ!"Isu1294i ;l&HHuY3 * (5*-c\WN:cGo_-O/~~Gӟc,1r_ߺn-^̵''Սӧ{xGq^Y=8lD[.cՍ^=>2*VP"!Z#XUckt7\=/6w/;oѿ]5G'IN6fsu;cyɩ?gSc͚N'IuG~ d""UڕL+b\qTR.9bijjjo4苙s1y){ ޮ[HԶK팚U\'}?LG<6OF|s]}ޭ~-LS:LM*F:׌ kQp"KiQs-bS?W^ZtNS-T$j޶]T'?h{LrMM7Rq^J8)W 4O}8ҽK_?c-r/JX-5CoN6nA @ۙH5nN-N_)6>N0vMPn}e=NGIe9/.-J6q>U{Hbƭ"ǀ`{Nth*gs"$[VԄQLi\ r6m ֘xkw3r_#!ȵ@66݇LCճzz|fpI-Jm&>O"[ڷs sKmӍ"lpzNI1(FN_swHw As8VĂ$ať?w/(K]ekն(<,sqiKϰkI@&Ci|gh!xOhZ엃`A V yW7VEت$Re!tkC.!׏J,Bi9{e L?-VYN+-3Mqak>_f=vjh&sCAMWYn j04sBҾ޾E,HPU .mU];KSFsfM;\@s|%v>H8[&ǧR>H;.l[jT/*޽E:M]c8 {_opvK7+!6== RHp}294.˱]nW*U޻WjU--l[Sz4;_m*צwb0KiHk_ɦ齤4:0.PJ*2ۂviARql~>pڽ0iEߒom"F0ɘeﲮ>53sR`B/FD6[^{W-LF('G/4]+bBWH84tJQ2=Š5~}_zE<҃͗V3)2vʷ- d\\םN"W.Iݷ-DӵIa->oG' dvjWZܷ-qZ==NDu-֊)ݘi2r6`K~rs==fG0 m1:M ţl ;#;#cs dwVU*ʲY\iձ('ǎ5==9e-7w-w?pJqy$[dS470n/yo ȦŧJ$a|ol!^p:-9\}}E,|"S"|6&An#`DlWA6ZZ3wp'P{8z 1<7K MLjuH&-{HsDQu Dŏ莑֙8gJ۩ej!tX#gZM=lOhL[fOD<8{@w,F5Á4RjR_{Q)d4h+79il:6d=j9l&AO6#z UI d#,zîfH0=GD7B(mw3GO;<2Lo̥YVw~TFY# .ȉ'SMFJ~ZmLfvb~HAj&E]LZYI8P$%Õ*-aGjkr"+8ф8;#{DyB'&?̍%Źh]>HLL25k5GPn&-D>vl3isl?ΊMWիq)#ӘHՒ?@)rW-hXsO fBεm@dj(Z]Qӎm&Q .1pRe>߈ʃ oJv)EOb|jHt$9])ӮԤ8B/p2f,2k4ZaDbèXZ\{4-c8\1i_ǶJ-DdX,wmbOVPSL."}%4&krcZ}UI[Eq|r ň?IC _{WԫG"j،i0J~ECDckJȺEԏiYz jڜb1Rڶ5Z+utn [^ZֵcuRmU)ȅHTG"ОNkӂ;\CTXbؤj-U!v܃rpв>`N>&K>`n< ŏ}KjQU*Ejr ܶ[SYi^פ")ǻrl S1'B-93Bhŋe~(!R ЇjU8*T N=M5 իW[(W`c@ %Z.E{=jU(ZjaҚP(wǽ#JQ¤}(#Š!QA515յ ʵjՅQGvGwTw]T"+F%Mzazk [jmbjԊJ7$vHiJf@e2]<bWjCR*n%%Kb_&m S-oq6M><9bqr^e? K[(UZF 1M>4VR76le?%szĪ# "'r='r_BnS_+Di5h ,BkfcgEpB"FQ,b"Ķ=2$H&*YA6=*N}^m:ˁPDk%o3h6x3vwmM*g]:MF n>/%SFaRû0O l&Q+h*s%:?bԘff-goSƨ[mOTjNyhwHهVJxR"7T[p!j+cJ*L(4l[)OaW*NCnHD$g6W$slbd7cޣ,Zn>8#[wP둦5w*}+-LbL)p1cf{XhWDl^nAC\нlgo j[:g=lM!ɸ54M:NWh,oGdvtc:GY^S}YFw%7[f;;#V[Y-͚981sC2#23;Lv׃ lA춺֧wK ȌʌEуon7eM>=k=iGΆE̕*2x @U"^ş\._tT8nR+p2+(IHRSજvgYZ!׆`UWJexSL0TH3C1CЄMV3J ۖ4!-5)*اTug5Gm-_׸"Քu6c-HՑ1cԹ?az^hrAZ;UL q 6v\1$E}00<ϏB=((ʌȌ<΅PVFX M-FLeP[z/LhڣvR>'BCW>5Oφ=Uo6m$MtmGձ-Sy'7Sm2!1 A"Qa02@P#3BqR?d(NctЄ;D1ckT\1+İcFxS)aOyy:QYMlJ*maT5V֨ F+ l6)¨USb0aᮢ%.X W݊eON ^!T^-F!yXXxfU"Ppze0NX5ؤEZqN_Xj:7qS([ OHjaߺ2i%mO'qJ{:7J IM!dгF"ɫ3ծ(%хd^}~ġrtNrSF2)ЗNNȦXn\Hb.eq_%sd?SJ:Q,$Dž<=D!Jyb(Dcv"xѷH'cˈOrۑ>w#!H\R.4{Z-%%qD`S"x,/9n#q[2♴-af2Xk*ӛmlDY4E-$ygGa1G !Or8f݈țG17]-/r(֨ث*nu^KrR9,%fF2INc{nK*&(x/rcq4+%*tKZ8s؇-3S]-;20"GbS!t%Sth4"PjD[[\ 'to_'%#&8ؓf r'~ߍlݐV-[mkbF ֔ԅLqĖW-;VK-kMY00]|nGQp݊ C*+̜ttcBWlkBNB,Ab(ZN6"ʐ_vHL?rw")'s ae$bI|V7|ō&٥_r(qImoa ZmktJlii|,I `#hjmķpD^SekSC}dv,[c'b#\d8iYcj+Q"\ e4lVZo2W!?e,kr>-nOTgڏRr94.IJ;pior˸/LO|Lx~EgN$O.ܑRٗ>qDiSW-ŒW"a%zeoc "؋LfIۂERor[;C%HLp /Mˢԍq5"yb<"/I8\yܝXz/ï!WxmT)J]DJŋ cdN7ݝKt>Hizin+ɲ0g㣎ı" 4jrME9oNG|3V/aJ"2,M76c*TrȦYGCWN<*׭[I4DTk6M^NN̄nGgbW5d\| ^}N/ YTpe˗|AءiW /,F]DL6SSMc<7qzGkLv滢ޢO{Cb)Upf9[V+fb8TWT2GG56S{\Kqr\ ew<*hC䱕ro1">I nXv8GU(^?V蹨9qe DrGbby;6u[d=%'y=*],B3^Cȑa!՟CR iδ9E?Y\\;lW̌1TE|J$Nw!"N'9} uYv࣍28آUNU*]$X[0X tnrd"%)UR|a ơЎ%mźM_+ɗBY[!;Y[\y_ᮋ9E.2\j5 |[ٿn[-9/U˗/[DvŋeafYp[$[[;/˗E_E_J6-l"Lbٱ1e;dBqA!1AQ"a 2@q0R#3BPbr4`$S?\[Q`,Pf X<6ͼYf 5d&irB}F&)K\yݚ,AdfApsN=[tjGT"ԑ6s*AG4EzS?Q(apmOxtZ?W+l1ڍ#Or BǤiecJo҂9-Xq5ia{Lډ7E45@HIϭo"\7Ģ&>3wF{-G1}bD ZHM.v?K9Ac8?nKK|@:ճ [J NqJ?U|Nӈt E *x*@t+2|Dkk]lifLvDwXHupku,HɪdHJ5,C 1B;̩ aovuA|(=iȎgwST-qLtqeSYB[- اR)tD%nYq#`]b9.@.k辝Mj'DgөЛJ W쎆uy(rcbmnЅ;Ra}CXOD8[ AݻRcn vD[uVVD: xU2QâH5\>{P벃P{'U?4BiFR ih[VߵhAV; @ANm<{ 5mխ*@S3^k*d7C(\=xt>,\Uꎒ2[=PtNahfMsr"{8z ׅ B,4BΪ2YM#UjVMD]꽷J̇4荍 38%[ٙk@׍Jdˬnfq~ Ϛ?ᜐlMQTvYR5L*ZTATT[#ąx LZcnw Gbv 02 V!V:5=եF4#*0^!YЯd[~jY17G;Cm C\քfm8^ n ;Y8f ,9ÑiVvbPA!gEV4%3QY–=1舎K(j!^ 59^=ȬM Rĺ5.h&־hzJ4AGG6׽=t*@/u+*/WRjE }}U+uj28 xi,Xijm,^ t)#>ΘG:[f-,YhwSs&QH5Y!-#Р^|E4)DBd&uGƊ Bn*ݥl6PQo#n\ZRΡnGCmYh>"аZtR]%M%SЬr!م(&ޭ *dTbn@SrW#hX}T,uMPģ&T›H $wPvR2D.@AR' y]T颭B(Nnݒ BJt~eP `旞RB4VެO[+AQTdk 䳪E Ttց4cޭb캛]٣5W)>\~8sMl)\"jBI$2~ Κ!UY~)Hi܏?oUM7T%aq\n2H.Ԫz*D4* 1(h̩=뤃t"'eJ>zXQR#|6\gCTJpavSJX( |?# %V2+E͹7;ʛU'Q+tpY!tc8y_TB[9H̪% #(U\.PU>KaZ_4vς( Tq^!V&[qkX3Nieq"]YgU%2fiٞV$0<./8;ʎSYNLu1jĤv-*.nW=W5VwQT0t,HkSveuQQ@ g},WpĈ4_$.-3櫚@ \$T#M|t#;kuV/w8B,  T/%X/V 7lF UE_; ䷄ HȜ} :佣XԬ,``qhZYY-;6kD@RK{ x5A3Rݖ.L*^k)R %f\8A{WV軣Ęby,Զ,~%ռ;outc%QUJkRڿ;K+1. XlhC,KmoCԪKY wFӢڄCK;8gشdY*_%pN^F %ҷ[,+C5S%;')h\G"iy`R'w!*;Hh/_b˩\#3fV<\e$q[2 [ ϬǝqXՌXu*gYo0u-83U_I3_INr}CLJf5^q.=vG%<[`^9#Z:~~1 n1,Ѯ,Ơ;>9V>Z+F iƥ"}#" B+D8C'(]pe詐K4`voJ1sV)dKaKXN$uu eXoM: 0ZFc[s ln.m J½R؅k<&"b< Fu6AƕmvK(GidՄN4Y?[?uF8ͽ̴xC5q+8g1הU1J Jϕ.+?R^]rg/SƮ_@7i̊[#Lj4\7vr Y>а e0iec,1T͒PEx^b[.ꈛ+孼\D[5KȕC۹mTV:i]5W<ʻh4g,M{%5ƈs L˺17I}YT5MŬ690%?pa>`]ωaVVҏ#]CLUMӁ4k5@!J빆[f9EVxS(SJA̤'=2㉊+q ³ G&U@5T9:+;*V3. XVAK|Glnvc MقNyfA3j7qM_5iOn9Sơd(LzwK2sFJ[c8l2sڥ;hZhz=z'LN1֣ 7_LM}Q4pioH"&1[^<@6<`˃ hpF][L ZX}M`[p-ƈ^Zk[\0aw==+Uiu+٫Uښ$.X| aZz@̃3&9 M?FF^)J8'sx`Ӟ-3}p.LSAxSKY"~hb K[*oº6+hrJ2n.`ʝK1nq;P 7U)`nR)e$z"$hË]3^{5\%ԭp.1%e53[[kQxO&]Sܺjz1c_]=C+2ߒiqtc˛PFEωV DNͺ.A aeVG)K3{Ź&3="`g !<%d̥582G'x- 3 9<aS>V֔kψR]q2,Ѭ] 7ȽYr#"9%_e̕WIR5AO/Ԇv!;-i< E{ J&ÌsEP|'}2vLwZIKS'a@{Id5I1!ȍbfon5Qp բǘkI]k>R4qpV<$8PŻ'Inֿka2. `2>(0DlUt sXHLsCo@Sb[zf" F*\c+7G 92빰/-RP{A]F,2ad2ڮxYMaAq`j9laq/.`ҨW8J=k RnXnP #[Xv6%)ʊSaTz-Xϸ9zά({H i9faTq@!ȏ_S'2Gpt3X{nh%&. fCDBg*tK}*x5:Vb@L9Hd$l0;ЕYĬ zElGqf/%f+R|Gpx,!}6jQW'<#sT'UÄ.tɘĭmh+/>ʼn~`T.5H{:K` U|`-BtE%e,dFqh~\/mit |FtL#@=_j_ |Kr\P"(SB_07DƯ"oT|(n iJmR{J|fnZcRL^N?2ӕmC?ѷ(YV8qҘF E{n ` j3w'ma:Ik~q险(q}D W=aȫzJi҄Yw4Ixk)ZJU6$Qn[Ĭ*B!{|xD,(*Wb֙ywB'x ?60_Ve՛wqމCbiM]k2X@|~̜^&V,cPĩ3v:/̷^&L5mEy͐f[kĦ QBHVG;5*ΠWyA<ʪ˳Ϙ]W A7NZ1Q 4Os'thV_0jy/10BƩőI0Oy=8"#zs  @|} 2O$&|#E(s(0(3Y"''ī89~BZ΂K_+Xd\|/xVcfoǴNhpud9Z d,M\m0^*gFĽK T. ] J_ I5[KB. ̮!ypA!a^Ū> r:yw3x9̳Z7haޥ 5ҏ]pziWb#DÏB/6+B8΢(Do$]+/v1>"D^D]_~15(?.LϘ]X%3.uje'#u@BkpX^>c]n Gؤ rEqmH@e`?hn0gO[ ˗偼3'q~zcJIԴ5d,u&.T"px 'ZN<B)N6CɆaF) ?KefeF$ mf4]q]9l,>5jva)iAOP[j5,m}%+X'.)t]%U!X) 4mnv9D7 ņaN%"1{s+|x"L:b]* |㟤\bfa0N78D1d<-o'88[A2\f_A( #ab~ܫ1fF f=A->sLdNe˿_l\g2k=xEg[NU3Ag}oӮGSʒ9,ʌf%7(P:&|\{5``{Qx\@ p ijAmXHndSU:e4-_#ZIXQıAjXNc\.WYF%xϧ}5Z*WR_e]s{kHd' SR. <3; ?'Bkt6=D4K&͓Xкم:;N r!g~gKgY4︂"PGwĠ 'iM%{yD\(6~G}2 y^=fS p)?4Ma؂U]vh8W96D~aXVMG\nl{S+&jfjcb⋯2~[qWDi#4ܥ}nF9+L9ڏ"YæTOt鎽.K%}KFa՘y`*#eͱ]CXV`Z-%1;jSeL(Yz\"YQ/QVhtxL]i~H?ĺE| ܠxPqD$Q.:v4vw;֌&ʉ+S2L_3GS|FZ2fFdfW&A9Nq7*_J\Tt\%X' j1e"Xv_HXz5`Q[7z(=j*z-rB`)ŒK]@6+J, !}7f? ,9GK` glsA_; T4i䗈p^e廞Ff{`&b'E)i8G*tg/{rey{s+KFdž7`_^]s Wf.8yzeJeJQV)jO~Z dسc9.y%@jʟ)Ff[(dsB Òr5V@dhϙFs0qCRiRDc j6fNAS mt8{)U]>.3 mxig%hszeDP8T{@*U8&΃ĭ>/Qj*U9 8Y]%%4DM]Cp7ST$&,~{&|K^XɘLYQpB8q)ԦǙuXɅK3r[al7M+FX5pKYJ=l{:L) 5w(&a/lA cP"_8TjSh^k`I~\r=ˋ- @@[Rʬ&yk?H+,x;`Z:q8ceq;#B6Ұǃӧq`!*dQAb9JeWFAR0RsXBO]E KLC}5d2@v`$IJ SprTԹQ>c+0z)ĬJ%=Esf#{+qMzw3> aQ"2F=>#\T $4/!(%'b6+j\і^8<+1#R+}ys\.15ӉD@yqB.дTb]K_bV SP I$I$I$I$I$I $I$I$I$I$I$$H  I$I$I$I$I$v $$I$I$I$I$I$&A $@$I$I$I$I$I$$HI$I$I$I$I$I$HA I$I$I$I$I$I$$AI$I$I$I$I$I$HI$I$I$I$I$I$$II I$I$I$I$I$HI$I$I$I$I$I$Ix2qd $I$I$I$I$Q *FI$I;inI$I$I$]P H$I$I$I$.S$H׃tI$I$I$#qUEw̒I$I$I$gput톒9RI$I$I$|&7d^YKI$I$I$گN=3hI$I$I$_U’I$I$I$bWy4E67I$I$I$q縃ӺhK^I$I$I$v[k}6=r ZI$I$I$(NڳVȴ`T|#;^ʆ YicsnJP&IoFu] a %AtVuj}ESXX$4Gx?)(^I2LwX65q/ggJ11cW;(|%'hWYCm1pֈ7u'߳ ƸD.Yv])pmKC* Rεf{6K>s'] Lgcw`Ν9bnk7ƍe_JeeVOR'ҙ|4yw _E/9z(N> G{ 9 x[Vl1?{=-٣YgOkgPK {0X$G٩|!ۡ[,׍ gqL,m&ǡ a'/OdwypX &쿩fk-,Χ|p)f.#?Ȅ|cJ$6ؒPAɻwxNaԝvDܷP:xCJ|ϼ݇C杞?}9M;gc px @pUwddSOL[Fǃ#caW8LY5ʯQOPO<Y^$YlvpI:?VIɝ,i{˩$?N<,=teruv:^I/op!܋Χ`YL>X=G^X,7:,{hd}(#f.Gos3$h>yilu]/82g rø;u 'm7A84v V8Ocd>M%#3?wx\$`3Po!)Lٮ˫?r%X=]K=ù[Oܜ=74oV@ǹd9p2ζ*Fl{ùO}6;xsۊ9ܫzw8lRZk1lu'- ooH_⽳x\ݘ?rptuXY="y`}#}l1C37&?/g/ xK$۹Գl;,WX{=P/Q>lo[ ._,X|b wy-gg NN$[$#ؘϿpe=|7xw֓zI%ip/}ߵQ19wluy5-'$ܺi(OYbr~9ōo c]u|\NK:|^ՅLA;vw#o埁_}П!8 lXAzdB;dgZG r?Vݽ2;m;I!~^"H,3M}``WI!0gQk /&lԁ2pM-,uv"w$lݵy7d&]r/x}2c .YDE}m߰DwLCxF]x'.^= w] bso͡.}t*ct٤9_/]HKِ#;gٶeXN%};<ۺ?VDŽ2Fǝ nΫI; w,q g;KmeI.O]6}X,Yu60I Xf}K^0@}C{8MÄWacXu%囵=^c]B.acu{;\b9e"'-&zd/nې향mWZ|>{K_n %IF=y|m OE/dΛ T4N#`wc`N}twX"pN0E lu=OX{{=݆: 8606k|L:̇ !YòuS2vK5|y&d~}XfCa͙2ۇ_3 lI 01 X=m sۺ_|u{l8 9ow-nr9,:ZfYvƙc Dzo@ xLC{ͳ=<_.{ɂ͛vtr@V-ܛ@/Vzǂ3ؿ|/ߣ$6͆'{n{kd>Mp7v327Ǒpq-l^=c6wF1#0O>I:엡Z3Ca 46~162|gYppWcox7'C]?.-GVOR9I/|ڇ3! qCnpY([j 5O~s{ lKΌEN2d𲧜${g8B/`qՓ|7^ WˤfXsd|_n[en|I&޻Y~9Ye<^N|ll@b͂KVgqq۷׷m>_q}OG";5ˤ:܋>[O`0>D[xg0g۹1Z姤;"ԯRH=-tY#"nEoԞ .?{@~{y-`Kw<2|&ߤYM`FpO+Ъl ,uR|u$>O!-z^3C퓬6BLb}0ggJ9˶>'=3/u߻rmCgMvG|981s WI_[n߃p.$U8oOҫC#I>d} W;yX K=5uz]9b~7[n߂aD7R ?5qvѱ!, eWHCE|{k3<1]8_q 5<Noǟ+@Cw̏Yiņ:DzD1B9YydAg_s^M/Ze>ZpP0vߏ/NK@<`{{i{fgܽ}_a<3a7L'|^ɻGe8αf蹣#_/ٺe;)X6_WÇcll~$L8=6 P̱ IlS c˄4qoǧ~'T^1y ܼ%|ۺm".r{d>~oTIERWd?Rb"cz`ߎ$?P%b/I1</!F#v}" qÖ6 x09! lLѸN%!0aƧ}? \#۠D ے㌕uq<5ބ^[f,\{;(R?Y W~wב#L-Dh@S6" u&)dq|c ۞B/;?ߕ%k)@6즞lh.<:}@s.GwK;R6.wLmyqꘃX? zrzF`~bܿLف|ρ}V_AftZu x˺zEtTAe-u0w3#@ǧ¡9f?~3rL|56< ~Cw`Gh.~"C͛Itd]:l뙽?8avVpZq`!SOj1E;~wn[π#.>NJ2^F or ! y4 <{ ]6+|F;pF\5|@}f.͎ ~C Vo14{#Ieag"'>W~܂{pgȏld!( 77k>{in'@gwc X{\]K/'Ai5Blg; ɢJW>}Ͷ?&S܎$0>-[z~MwDN͌9i\i@綞d]%N@@7'iݾA I_rE3XYg_7QyJ:Aۀ- 峭c蜗,o'l9ܛφ,XJ:OZ`kqo:28A!cG=>xص3cF'o />M?ͷKYm7DMLBL\!ˆaڱ=1!D Q%~0d7 ̏C'߇}NHunheܹa1;,U?,>,=q1)~/=qu 6a>W%/'~I}?V>D\?pa(;i쏯?ReI]`dYSO$y2Qω(b˄ןp}$=pb:L\f~{O_kkl;ߏl-Cᖖ.ɖrɘ>6L.8fG eh C ,6ȫ1rP OE> 7alKlcW[O[7J!ó(}wO迃e!juvw|; os{XglC2e{;7O~3l˶|{r:XFX<+q}n?W.Wem _Rr>w3J `Yw\Y Qh?_= 3|~ >_F>o,2'MoNɋ{ .a<`7irlt~-`ok"Z<[hu٥vcmǷՑY >r߻3> >:Ʋta腇C%Hc1"'xx?O[y00|4Hfc1wK%"ak5Zv&w>_ge ~l!#2 }[m :${i\ gBkk'HX&rƠΤ-^D:vK 7h[d#Łp0Hkٖwycz Ar?cF 'VK.Iz;!4Z|+'lqmb"ŧhzO ^݌f خMg̣|3$[=eX>#u3aS~;`9cJ&yх^YV3u<}8r$2V!,u~"DX 31YV_*!}pV;e>2r[3C]-c#G_N%4hA[?C,Վ DŽ\k-9DfX,v^<`]{va}R7P>TW˿-o,mHoգ":aZ܍IA3rr]-?Β D5[NOa`.M&A<35%s&qb}.?{63/}mFq~ЗIɞ^boՖDŽ}%OPJ胺6.붣ebr_}wF,Ð/}%$lziK6s57tchx }d؅؇&Zo/OK}]^q(r>5=d2.ex|n&؋SW.Vo62/Q|/|~7c4JfZ7?'vQyN~ھ0N)_vÜ^U{6}Եۉ .BvMG~/69b}"հ]e>K0-&~iy)t ܧV}G=|\$f0w2%mm~ :JeرzD}s6sM$"0z3ǗmqC3iՉ9#cc}+=VΓ;}\Y=guόCseOȎFXgÏ-m3aسn-7|Y?";r1'Sk=Y|==n|c&0,Dv]`o[ . BVl ~lۇ#3πay'䀴 >- e.Fv{vݞ^XF%m%61υ -Vv_–ǘCor~NĂŇ7LnΜ6z;,}l?{xs96r=ϫr<5 x!Gf|:ke!$mƏa&3lD5rRT?R~gyg[u̜,Iɞ^>Nų߹ =8,F.J>l폻7?boan~gdvyeK6_ra 2ve@XgMY}ԗߌI\Տ9 ~;n۾m*!1AQaq@P 0?2owՔ2rN=3Uan gM/Bf8GYp UZ9/#U3&I<Ԭ;0Xef̸Z;y)J JJ~pGmtH鮡^g|,skZ1U@592~_jiUzᔠQ`C [[9ݱ~)*9`1[p9B3ԺY̷,_-5X27W g 'ZuZќ͕ e/)t,\˱jBįؾ9iVb}K /ξRҟ̱̠.wIpu;S1 6S`UqxĬW4wXia+0 ss,lZ=A%3dn s>ӣ##êT@3Jy`pqm+S_̾/fRn. S(0eqɷΧjur#[>rï1Ģux2YGQ < [M)kB)t|-rNFV\R8tb$: *bSm%o @P9 ?@)\=U:<#6WM!{5;"&et1NJ~aد{}bXjʕ @W%BқۊA7&F) {A2w&@Չq$7°("I2/M0t.NF~?븆^Aګ᣷whqIzQӄS1=X¿h|w/*(]]YIl]p ^[VLAr2ڕ+GQT|B9eۊ>`4"o.K#B{)]tD)XFܮr2(C/1e)dXҜ1\hC[~˥(b6խtlYϘv5gø"R0`Tm(qECY[V έv\eZgU+%(F4C 8QY8Lfe7n\+I8-Y֧Įdms3nWXejYb5Q }n,xJC;=)E;<31r pN.c7K&GDӆqwb.C.Ŕmr25Ū]xYXV;*. rnLיa6v UO ia) ,L)ϝʛSf6eyw\#]|koeeh|2W T8;!k]f]sq`@(*x+Y*WPv \0G(X*u9U>%sS? &^ܰ$x X:PXy +1`:RDR0QиNJؽD _|LB½@PQb =}ib=̍sRqw2 k@l wIX\FF   xLt]ݳyRvDӛ0X+5R.\@3AAvs&l7JvŞeR1p8<[};XX/RM.AkܢJ@|<#x{-mTzG* 63{[@mV_4~ `,L`x:єQ~4_*yc!W׊ph&ΥTEGmsn$8p*?Bu@Kz%ria\}j=>"MUR)̰jyZp 4# `Tᯘ I GTwD  @u6k,|\i9^-JyWT+ZQЃj3opUEmЯ-7 Y+Rll}jUԒnUls Z}&7ڏEL?6g(2Yx8*|*x ! !^?x`*Vg 晏-UGvxF8eiPiUlHsPavk<Kww2޻[t NMf%Ym 8cd9u , c дO>UWkc {Vr9kt%$(Б.›w" *%e\fbiǁטOEJC{+:rpP[svŧ/I@ .9ua^/a@2P u|8~"E5Ct gUZ=x.7uL,kgf|3ClҲW= #J< $>٦*BQxJ0VUZ4 dxjcK^dQh"([ NWz|}#\$.^=~|-̸g&i;qK_9l>aX4FfXP@ӣ* [vEE lc1̢Rr2 ..7벷.Buθr)I/-tW F}K-ը.wX|uj5L7"fϚ‚,ƺOYYm˸|+Y:GV? 1Lr0yq98ZhTzQQ 1 R \g!v)YUUst^(e]kB zƒ6sdvFYZ% 53E0, s!H)9e.S9J{Z[e$VAj0^{CkAj`(|U(\T@oPE `íu|UC4{SZ;% IG V+psl꟨ʒ(oFUYp^ʌ65YPm&jeeVE_#[yp 0R L0++iSWU q"٭f,X^Uw \ZH:VK:C|c\GJYeTSC( D%\$y|LtũOd o!)dlIUmeˊ ]ɭ\טͳc{l0{bn9͑tr n|T@wPQS5K,#i{Ւ 3ery#Ftp٤)@(Fr6V|~eaxjw`nwV< X&+i2D aKl!`i2ϋ8 V#D%lgP #[l /7m/"5P`"𳛮wuhre,/Q)q1N4d$E|H1ˎ,3V<(5l5@[nf;qf!VɉZռM`0]K^_.p 3-J*e6)W`t澡AmwFQx-%9Ğ/'kвƵܑaiy^-͕,v4[L4jVSG2Ж Xjk ]axȸ:'$@C1g5YP˜(ʿ'^ 5¹##b:d5Ye>L0+(ZS{F>`yWn20 >oɭQG1$)!^ea  U*nu VL1&*ڽإ@axNp n,`7xT k/S9WQ!hOωfE Sej4qR{_p-Qm^IQ % %`KGf o*bpaI}ƯUde'4E?VP\^m/;⫏t61yaUj*"_|ߵǿ%#mѻuBE=x&A?pa?QJ e\=nPz/ 9KjnJ@ :qQ^/ ΀--"(v~rV& zqMtM+j5,ȯ -CFG:=uskakgMq*nLaJmukO\E{,s *)lT}-{kUrձ+ 4\h-JywR 3\x.z@MQ1s7rk4-UΉ߉%{xϹΘ2[!,' ӔW#6xq,z|ӹb MPeU;"Ua1MQ_2 AV| }2D%T(-qHD1sT9"k@U_98`:yq/ zW!ԢWT%:`0`>;0+Wlշ V(짝 "c'F`+SKZ5p@XjK Ӽ̢(Α7|$C8i Ś3JLsǿ@y{>X<ȉ!21T0VM5䊚lUxQHY[cy8vy =9>3d#'ĩac!p,2"hKt{ɟ$",sws"| 5`І^Q%- k5keLSi8 y^]G(#[}[0Pڪf`=:ɦ0E/]` u) lѶ)YٛC8gUUܱp$j&i_:!)k5XA(/|W] ,r/P7esdȽ )Q>FcmuzjU[{Yq1:xܹTy#iF6)] Ejh&UEz,c1r@;w#Tbmnfu2 p:pBrљLp>–Kٽ!o0 cxml>eT$Qs(8̶|AP.4_+~ ω]"98(pL Xl ix3Mi";pE >+iY|\Z il]M:o>KaUcmÊli~bW8|FX~q*3.T,A*!x)Hj G yUP͜{\\i9̵ganK\:K;uzr[7+<0URx5Tm[6rg;Ԣګ?~40pT_ ]71 {M-gK{PWJL^u=7(+ Wl [6Ul6~ d_Shs/MST0+E0ы$s_wpY[g&F6 ;p7pWpzQڸ|7ͧv|lE /魝*e"G%_ȗ\_@<85QpDNDMݔ{ARc?Rͧk @ DVna_-Cճ\T]LrA"jU'B63ʠqx8dX 8jPA^i0u ؔ |lw>g2PjRvei h5[5>_yx zN6tg 7y.-Kx>1%ۂ69>i}ص{yFKo$ອAi2 [&S +%_ĢP>Qcw"n."|^M{b9lZ6r/n0@­hw&Em_ P*4PaZ̺iH\)>J\5|F+2n?^7˨D|8_4ndlG^qk9t]}]'>={4MxV 9@ (sX-͠O̮)_]ː!T8RV?"L˦y6X>;W` RZRpRz9j\UUjgZ8{DBr޼?gv MLuⱅ̯+kW? %96< XrpA 5JR6uO+Z +Q2bX[Qi2OpmSϊ/'v[RrE `+rioеlN/Eی.-Zs\yW.(kǬk%M]*ʗ,jd-1/9fOR/ p(*&{Vc6:|ĊKQD 􊦅=Sty%L.O|ѠK&d<3SWeRdeGCzrTN7UXe@)Xˋz8`&u{Ls*C;E"l>7҃/7}ycpB17^` woVGD4͍QYĪoʟPRxBܘ!25 t_PPh _,iGqȸV_=)ZҎW=xC _0Y,LvC$-X[Ġ43_-h_._C"Kdʑ_cC-)SweqEQy&AFA7yg1oG Ijۊxc<:JԊW4bg@EiVp;wG)q*β\4-XKg8\|~7tW"VWʏ+?,60Ш ^ h}E}i+4_ 5P?e+eb ovU:pVYfXܥUÄ̿At/[]JEC|Ca-@mDUDr+l[n*]̬†@A)Eh0Hq1Rlť%_3ETF|t٘WSFi2]d/ YR:E+4 ڰ{|e;%Ь|UálL @7̭f ZkRx`nNxOZwNQC߸5@UJfq *s!R9ya& 1(1qKcPY8P$1#C"mCZѩ[Y Z8g`nYuP: 0/3QXOWv@CU~wH9>)B'Loا0/9 O /-cAl_W <;=E45̞HP*(z#.Sas[x-sb:#0xbo.żBF/>źŅ>:S"pEtX((_ fn"L& 1ƅ j}8?QT:qlpd:n,]ĻKEvT2QùAL"pJ4Yxc'j˫^$ԹU`G@n}JZy,4V\̩vhViAaMD#)J0rԨ L}v1PV]Xe"o,2ؘ @5Os'gRр&k'pbT^<'?}FC| +)gS:ƘZ[< Jx&Y22eF|AlNCɩKwy"A<~`.&UJABJ =԰))+KKX-[E# #Fzh饎6h\0#~%s0WIJYab#-9msԻz"&e ,{ 8T_7@1^W9Zȅs*sI@@%`j;@ V)%rjo̫FOQ9Z^.^u O uy) Ӓq\n=7U)Q82h嫂6i`@Y[3&7 \8QR~h:SC|XVшf{P1Rb>U1ߎeB\N`E9!2+^ڡpWhq\TPqЪ1v!N͑sm fەv/LB޽Wt[MUc4]?XAM' 6R8z_LMCeˆ e@ ^b{P]OA 8癌E7ч ;=@Ze.Ϧf\DjuNdǜClĻArYK{*ҿ4w6w^eِT}ɘD݃Liwʸ2s~jѸ#^H1 d"b$-+UjUyFEvB1 Z|Cr5OW {A 8F]XJ*EC+@g)C^ĭl1J G 'uYkEml)>b*4u~`K4@Ghf—V'55m:^e߈4Z38UAxz>"]8ODr#uϸ|+yiPYO "8bDFjTovAھ7y,^1Y ܾI/6Jx*͡nkxd.H9V@r~z-paiy x,x@ELZ/>|AP2m͘q|?ӎ;3]jF xUq,DI:w].Oq@kZF / OEEeGIg , !gGmU%=L__NZTo&TD<&~d\X"L*%"X F\sN_i#[y%)q_{•|"ԏfKDho7{DBX5[3]AڰEx{u]J,pXTX ULi. 4W.q[ JQXF*z/!ˀ) %,JgjPxYDŻo)x7z(arE9:14<]ʚT2_2`_?" : ^<+ ժ| )~gcŞ8) 7C'`k oOp8ДCAE?2J-t;[ _-&n)67YߟXt2JVAu$5p _SX /!fVW<|vC Du&j#xJMP4;itՕe mg;a A.JfL'l1p7%ZB x.OS)8(xDE}FEywM A%&R}Y-e [9BgLQ).imK;%Jс*tYOh+wiVDPn9% -y-kb*lT1E:eUYo+qaRƫ-̡ϑrX//;t~|V`^Uq$_M,.s3k4jiqCq-GVie&zpA PY.u+V9-3چ1"8e[JFXxßq ڄ}A 4VfiJ➫pp ([~e+rqgĥ'aL&Y5^leV\+ӆ-_ca`PHK=!A4YEO!ʳL08hX(;ƪ}~ [m7aC2!{.ɋjWDCԩS®5y.+oLEeWDbRQ9`(nˍ;W\]˙i}EMvD8190_7 x$7BUiX/@_4}C(iBDX7RϠB:0P._ǽ˨gzN< @hs(e)1at ABBruh +.|L$!I\'pi40E S1L,s:.k~2hV //m䋠')Tw+AԺxkx),|FA7~~BY*FRᶫ^bG7k\785ꡝ(镃iT,YZQ D/͎aV,%cq b u,P}EQw* i lAW|B|j R۬Wck7Y\REJ ;ݎJk+);O)ZU7iPy{F-3$Y _Go0[QYQʆp? b6x a<=Ç0 gQ8"⎹m] En{EB;DA SkՑz8B 㸶+/5,.6}qt՝LS\l2CZ{ P|? #s8"TEp7p| ק0@.YP^#~ sn߶nAP L zɰ>.Z:{1y~Opj0GWCt 4^|K2حFżbbDN yVy3! \Ӷ( /r)Nn󨮷./m9aK %S6xY4SMB;L3XpYfg8&?py`N+aZ¯ VKr4sEgbjBLMMo2R +Uܴb]Z|@)LL2s[-b!4E4%'eRn8TGp 4\5M|'x#*m gdr%!,P4*Tl1BaPfc {: 9zw+dSԸy_Q(pPa ]/ԣ [01_`sC ot@( lGpW,]"Qq66BEPAnee@ygc3ywYbkyƠB (̡=a)q36^O22喸Y`P_=BSA7#. ) d^cRZW2Ҹ S WV.eQȡC2+ɸ#k$$bmBwqbSY^PLRX]J{u[(ZL#~*q,l{;i\Wfrܩ႕Dgr qc͔)o&%=SF"eZ D*Yp%Mn01J%(L[ s12"{9>'* {m!V@KԍTF *z:e8kݜT&|K^IGj|. KtpSTʮz cx@M磩D-\ c # Cecill v2 : Emmanuel QUEMENER # # Thanks to Andreas Klockner for PyCUDA: # http://mathema.tician.de/software/pycuda # Thanks to Andreas Klockner for PyOpenCL: # http://mathema.tician.de/software/pyopencl # # 2013-01-01 : problems with launch timeout # http://stackoverflow.com/questions/497685/how-do-you-get-around-the-maximum-cuda-run-time # Option "Interactive" "0" in /etc/X11/xorg.conf import getopt import itertools import sys import time from socket import gethostname # Common tools import numpy class PenStacle: """Pentacle of Statistics from data""" Avg = 0 Med = 0 Std = 0 Min = 0 Max = 0 def __init__(self, Data): self.Avg = numpy.average(Data) self.Med = numpy.median(Data) self.Std = numpy.std(Data) self.Max = numpy.max(Data) self.Min = numpy.min(Data) def display(self): print("%s %s %s %s %s" % (self.Avg, self.Med, self.Std, self.Min, self.Max)) class Experience: """Metrology for experiences""" DeviceStyle = "" DeviceId = 0 AvgD = 0 MedD = 0 StdD = 0 MinD = 0 MaxD = 0 AvgR = 0 MedR = 0 StdR = 0 MinR = 0 MaxR = 0 def __init__(self, DeviceStyle, DeviceId, Iterations): self.DeviceStyle = DeviceStyle self.DeviceId = DeviceId self.Iterations def Metrology(self, Data): Duration = PenStacle(Data) Rate = PenStacle(Iterations / Data) print("Duration %s" % Duration) print("Rate %s" % Rate) def DictionariesAPI(): Marsaglia = {"CONG": 0, "SHR3": 1, "MWC": 2, "KISS": 3} Computing = {"INT32": 0, "INT64": 1, "FP32": 2, "FP64": 3} Test = {True: 1, False: 0} return (Marsaglia, Computing, Test) # find prime factors of a number # Get for WWW : # http://pythonism.wordpress.com/2008/05/17/looking-at-factorisation-in-python/ def PrimeFactors(x): factorlist = numpy.array([]).astype("uint32") loop = 2 while loop <= x: if x % loop == 0: x /= loop factorlist = numpy.append(factorlist, [loop]) else: loop += 1 return factorlist # Try to find the best thread number in Hybrid approach (Blocks&Threads) # output is thread number def BestThreadsNumber(jobs): factors = PrimeFactors(jobs) matrix = numpy.append([factors], [factors[::-1]], axis=0) threads = 1 for factor in matrix.transpose().ravel(): threads = threads * factor if threads * threads > jobs or threads > 512: break return int(threads) # Predicted Amdahl Law (Reduced with s=1-p) def AmdahlR(N, T1, p): return T1 * (1 - p + p / N) # Predicted Amdahl Law def Amdahl(N, T1, s, p): return T1 * (s + p / N) # Predicted Mylq Law with first order def Mylq(N, T1, s, c, p): return T1 * (s + p / N) + c * N # Predicted Mylq Law with second order def Mylq2(N, T1, s, c1, c2, p): return T1 * (s + p / N) + c1 * N + c2 * N * N def KernelCodeCuda(): KERNEL_CODE_CUDA = """ #define TCONG 0 #define TSHR3 1 #define TMWC 2 #define TKISS 3 #define TINT32 0 #define TINT64 1 #define TFP32 2 #define TFP64 3 #define IFTHEN 1 // Marsaglia RNG very simple implementation #define znew ((z=36969*(z&65535)+(z>>16))<<16) #define wnew ((w=18000*(w&65535)+(w>>16))&65535) #define MWC (znew+wnew) #define SHR3 (jsr=(jsr=(jsr=jsr^(jsr<<17))^(jsr>>13))^(jsr<<5)) #define CONG (jcong=69069*jcong+1234567) #define KISS ((MWC^CONG)+SHR3) #define MWCfp MWC * 2.328306435454494e-10f #define KISSfp KISS * 2.328306435454494e-10f #define SHR3fp SHR3 * 2.328306435454494e-10f #define CONGfp CONG * 2.328306435454494e-10f __device__ ulong MainLoop(ulong iterations,uint seed_w,uint seed_z,size_t work) { #if TRNG == TCONG uint jcong=seed_z+work; #elif TRNG == TSHR3 uint jsr=seed_w+work; #elif TRNG == TMWC uint z=seed_z+work; uint w=seed_w+work; #elif TRNG == TKISS uint jcong=seed_z+work; uint jsr=seed_w+work; uint z=seed_z-work; uint w=seed_w-work; #endif ulong total=0; for (ulong i=0;i>17 ; uint y=CONG>>17 ; #elif TRNG == TSHR3 uint x=SHR3>>17 ; uint y=SHR3>>17 ; #elif TRNG == TMWC uint x=MWC>>17 ; uint y=MWC>>17 ; #elif TRNG == TKISS uint x=KISS>>17 ; uint y=KISS>>17 ; #endif #elif TYPE == TINT64 #define THEONE 4611686018427387904 #if TRNG == TCONG ulong x=(ulong)(CONG>>1) ; ulong y=(ulong)(CONG>>1) ; #elif TRNG == TSHR3 ulong x=(ulong)(SHR3>>1) ; ulong y=(ulong)(SHR3>>1) ; #elif TRNG == TMWC ulong x=(ulong)(MWC>>1) ; ulong y=(ulong)(MWC>>1) ; #elif TRNG == TKISS ulong x=(ulong)(KISS>>1) ; ulong y=(ulong)(KISS>>1) ; #endif #elif TYPE == TFP32 #define THEONE 1.0f #if TRNG == TCONG float x=CONGfp ; float y=CONGfp ; #elif TRNG == TSHR3 float x=SHR3fp ; float y=SHR3fp ; #elif TRNG == TMWC float x=MWCfp ; float y=MWCfp ; #elif TRNG == TKISS float x=KISSfp ; float y=KISSfp ; #endif #elif TYPE == TFP64 #define THEONE 1.0f #if TRNG == TCONG double x=(double)CONGfp ; double y=(double)CONGfp ; #elif TRNG == TSHR3 double x=(double)SHR3fp ; double y=(double)SHR3fp ; #elif TRNG == TMWC double x=(double)MWCfp ; double y=(double)MWCfp ; #elif TRNG == TKISS double x=(double)KISSfp ; double y=(double)KISSfp ; #endif #endif #if TEST == IFTHEN if ((x*x+y*y) <=THEONE) { total+=1; } #else ulong inside=((x*x+y*y) <= THEONE) ? 1:0; total+=inside; #endif } return(total); } __global__ void MainLoopBlocks(ulong *s,ulong iterations,uint seed_w,uint seed_z) { ulong total=MainLoop(iterations,seed_z,seed_w,blockIdx.x); s[blockIdx.x]=total; __syncthreads(); } __global__ void MainLoopThreads(ulong *s,ulong iterations,uint seed_w,uint seed_z) { ulong total=MainLoop(iterations,seed_z,seed_w,threadIdx.x); s[threadIdx.x]=total; __syncthreads(); } __global__ void MainLoopHybrid(ulong *s,ulong iterations,uint seed_w,uint seed_z) { ulong total=MainLoop(iterations,seed_z,seed_w,blockDim.x*blockIdx.x+threadIdx.x); s[blockDim.x*blockIdx.x+threadIdx.x]=total; __syncthreads(); } """ return KERNEL_CODE_CUDA def KernelCodeOpenCL(): KERNEL_CODE_OPENCL = """ #define TCONG 0 #define TSHR3 1 #define TMWC 2 #define TKISS 3 #define TINT32 0 #define TINT64 1 #define TFP32 2 #define TFP64 3 #define IFTHEN 1 // Marsaglia RNG very simple implementation #define znew ((z=36969*(z&65535)+(z>>16))<<16) #define wnew ((w=18000*(w&65535)+(w>>16))&65535) #define MWC (znew+wnew) #define SHR3 (jsr=(jsr=(jsr=jsr^(jsr<<17))^(jsr>>13))^(jsr<<5)) #define CONG (jcong=69069*jcong+1234567) #define KISS ((MWC^CONG)+SHR3) #define MWCfp MWC * 2.328306435454494e-10f #define KISSfp KISS * 2.328306435454494e-10f #define CONGfp CONG * 2.328306435454494e-10f #define SHR3fp SHR3 * 2.328306435454494e-10f ulong MainLoop(ulong iterations,uint seed_z,uint seed_w,size_t work) { #if TRNG == TCONG uint jcong=seed_z+work; #elif TRNG == TSHR3 uint jsr=seed_w+work; #elif TRNG == TMWC uint z=seed_z+work; uint w=seed_w+work; #elif TRNG == TKISS uint jcong=seed_z+work; uint jsr=seed_w+work; uint z=seed_z-work; uint w=seed_w-work; #endif ulong total=0; for (ulong i=0;i>17 ; uint y=CONG>>17 ; #elif TRNG == TSHR3 uint x=SHR3>>17 ; uint y=SHR3>>17 ; #elif TRNG == TMWC uint x=MWC>>17 ; uint y=MWC>>17 ; #elif TRNG == TKISS uint x=KISS>>17 ; uint y=KISS>>17 ; #endif #elif TYPE == TINT64 #define THEONE 4611686018427387904 #if TRNG == TCONG ulong x=(ulong)(CONG>>1) ; ulong y=(ulong)(CONG>>1) ; #elif TRNG == TSHR3 ulong x=(ulong)(SHR3>>1) ; ulong y=(ulong)(SHR3>>1) ; #elif TRNG == TMWC ulong x=(ulong)(MWC>>1) ; ulong y=(ulong)(MWC>>1) ; #elif TRNG == TKISS ulong x=(ulong)(KISS>>1) ; ulong y=(ulong)(KISS>>1) ; #endif #elif TYPE == TFP32 #define THEONE 1.0f #if TRNG == TCONG float x=CONGfp ; float y=CONGfp ; #elif TRNG == TSHR3 float x=SHR3fp ; float y=SHR3fp ; #elif TRNG == TMWC float x=MWCfp ; float y=MWCfp ; #elif TRNG == TKISS float x=KISSfp ; float y=KISSfp ; #endif #elif TYPE == TFP64 #pragma OPENCL EXTENSION cl_khr_fp64: enable #define THEONE 1.0f #if TRNG == TCONG double x=(double)CONGfp ; double y=(double)CONGfp ; #elif TRNG == TSHR3 double x=(double)SHR3fp ; double y=(double)SHR3fp ; #elif TRNG == TMWC double x=(double)MWCfp ; double y=(double)MWCfp ; #elif TRNG == TKISS double x=(double)KISSfp ; double y=(double)KISSfp ; #endif #endif #if TEST == IFTHEN if ((x*x+y*y) <= THEONE) { total+=1; } #else ulong inside=((x*x+y*y) <= THEONE) ? 1:0; total+=inside; #endif } return(total); } __kernel void MainLoopGlobal( __global ulong *s,ulong iterations,uint seed_w,uint seed_z) { ulong total=MainLoop(iterations,seed_z,seed_w,get_global_id(0)); barrier(CLK_GLOBAL_MEM_FENCE); s[get_global_id(0)]=total; } __kernel void MainLoopLocal( __global ulong *s,ulong iterations,uint seed_w,uint seed_z) { ulong total=MainLoop(iterations,seed_z,seed_w,get_local_id(0)); barrier(CLK_LOCAL_MEM_FENCE); s[get_local_id(0)]=total; } __kernel void MainLoopHybrid( __global ulong *s,ulong iterations,uint seed_w,uint seed_z) { ulong total=MainLoop(iterations,seed_z,seed_w,get_global_id(0)); barrier(CLK_GLOBAL_MEM_FENCE || CLK_LOCAL_MEM_FENCE); s[get_global_id(0)]=total; } """ return KERNEL_CODE_OPENCL def MetropolisCuda(InputCU): print("Inside ", InputCU) iterations = InputCU["Iterations"] steps = InputCU["Steps"] blocks = InputCU["Blocks"] threads = InputCU["Threads"] Device = InputCU["Device"] RNG = InputCU["RNG"] ValueType = InputCU["ValueType"] Seeds = InputCU["Seeds"] Marsaglia, Computing, Test = DictionariesAPI() try: # For PyCUDA import import pycuda.driver as cuda from pycuda.compiler import SourceModule cuda.init() for Id in range(cuda.Device.count()): if Id == Device: XPU = cuda.Device(Id) print("GPU selected %s" % XPU.name()) print except ImportError: print("Platform does not seem to support CUDA") circle = numpy.zeros(blocks * threads).astype(numpy.uint64) circleCU = cuda.InOut(circle) # circleCU = cuda.mem_alloc(circle.size*circle.dtype.itemize) # cuda.memcpy_htod(circleCU, circle) Context = XPU.make_context() try: mod = SourceModule( KernelCodeCuda(), options=[ "--compiler-options", "-DTRNG=%i -DTYPE=%s" % (Marsaglia[RNG], Computing[ValueType]), ], ) # mod = SourceModule(KernelCodeCuda(),nvcc='nvcc',keep=True) # Needed to set the compiler via ccbin for CUDA9 implementation # mod = SourceModule(KernelCodeCuda(),options=['-ccbin','clang-3.9','--compiler-options','-DTRNG=%i' % Marsaglia[RNG],'-DTYPE=%s' % Computing[ValueType],'-DTEST=%s' % Test[TestType]],keep=True) # noqa: E501 except Exception: print("Compilation seems to break") MetropolisBlocksCU = mod.get_function("MainLoopBlocks") MetropolisThreadsCU = mod.get_function("MainLoopThreads") MetropolisHybridCU = mod.get_function("MainLoopHybrid") MyDuration = numpy.zeros(steps) jobs = blocks * threads iterationsCU = numpy.uint64(iterations / jobs) if iterations % jobs != 0: iterationsCU += numpy.uint64(1) for i in range(steps): start_time = time.time() try: MetropolisHybridCU( circleCU, numpy.uint64(iterationsCU), numpy.uint32(Seeds[0]), numpy.uint32(Seeds[1]), grid=(blocks, 1), block=(threads, 1, 1), ) except Exception: print("Crash during CUDA call") elapsed = time.time() - start_time print( "(Blocks/Threads)=(%i,%i) method done in %.2f s..." % (blocks, threads, elapsed) ) MyDuration[i] = elapsed OutputCU = { "Inside": sum(circle), "NewIterations": numpy.uint64(iterationsCU * jobs), "Duration": MyDuration, } print(OutputCU) Context.pop() Context.detach() return OutputCU def MetropolisOpenCL(InputCL): import pyopencl as cl iterations = InputCL["Iterations"] steps = InputCL["Steps"] blocks = InputCL["Blocks"] threads = InputCL["Threads"] Device = InputCL["Device"] RNG = InputCL["RNG"] ValueType = InputCL["ValueType"] TestType = InputCL["IfThen"] Seeds = InputCL["Seeds"] Marsaglia, Computing, Test = DictionariesAPI() # Initialisation des variables en les CASTant correctement Id = 0 HasXPU = False for platform in cl.get_platforms(): for device in platform.get_devices(): if Id == Device: XPU = device print("CPU/GPU selected: ", device.name.lstrip()) HasXPU = True Id += 1 # print(Id) if not HasXPU: print("No XPU #%i found in all of %i devices, sorry..." % (Device, Id - 1)) sys.exit() # Je cree le contexte et la queue pour son execution try: ctx = cl.Context(devices=[XPU]) queue = cl.CommandQueue( ctx, properties=cl.command_queue_properties.PROFILING_ENABLE ) except Exception: print("Crash during context creation") # Je recupere les flag possibles pour les buffers mf = cl.mem_flags circle = numpy.zeros(blocks * threads).astype(numpy.uint64) circleCL = cl.Buffer(ctx, mf.WRITE_ONLY | mf.COPY_HOST_PTR, hostbuf=circle) MetropolisCL = cl.Program(ctx, KernelCodeOpenCL()).build( options="-cl-mad-enable -cl-fast-relaxed-math -DTRNG=%i -DTYPE=%s -DTEST=%s" % (Marsaglia[RNG], Computing[ValueType], Test[TestType]) ) MyDuration = numpy.zeros(steps) jobs = blocks * threads iterationsCL = numpy.uint64(iterations / jobs) if iterations % jobs != 0: iterationsCL += 1 for i in range(steps): start_time = time.time() if threads == 1: CLLaunch = MetropolisCL.MainLoopGlobal( queue, (blocks,), None, circleCL, numpy.uint64(iterationsCL), numpy.uint32(Seeds[0]), numpy.uint32(Seeds[1]), ) else: CLLaunch = MetropolisCL.MainLoopHybrid( queue, (jobs,), (threads,), circleCL, numpy.uint64(iterationsCL), numpy.uint32(Seeds[0]), numpy.uint32(Seeds[1]), ) CLLaunch.wait() cl.enqueue_copy(queue, circle, circleCL).wait() elapsed = time.time() - start_time print( "(Blocks/Threads)=(%i,%i) method done in %.2f s..." % (blocks, threads, elapsed) ) # Elapsed method based on CLLaunch doesn't work for Beignet OpenCL # elapsed = 1e-9*(CLLaunch.profile.end - CLLaunch.profile.start) # print circle,numpy.mean(circle),numpy.median(circle),numpy.std(circle) MyDuration[i] = elapsed # AllPi=4./numpy.float32(iterationsCL)*circle.astype(numpy.float32) # MyPi[i]=numpy.median(AllPi) # print MyPi[i],numpy.std(AllPi),MyDuration[i] circleCL.release() OutputCL = { "Inside": sum(circle), "NewIterations": numpy.uint64(iterationsCL * jobs), "Duration": MyDuration, } # print(OutputCL) return OutputCL def FitAndPrint(N, D, Curves): import matplotlib.pyplot as plt from scipy.optimize import curve_fit try: coeffs_Amdahl, matcov_Amdahl = curve_fit(Amdahl, N, D) D_Amdahl = Amdahl(N, coeffs_Amdahl[0], coeffs_Amdahl[1], coeffs_Amdahl[2]) coeffs_Amdahl[1] = coeffs_Amdahl[1] * coeffs_Amdahl[0] / D[0] coeffs_Amdahl[2] = coeffs_Amdahl[2] * coeffs_Amdahl[0] / D[0] coeffs_Amdahl[0] = D[0] print( "Amdahl Normalized: T=%.2f(%.6f+%.6f/N)" % (coeffs_Amdahl[0], coeffs_Amdahl[1], coeffs_Amdahl[2]) ) except Exception: print("Impossible to fit for Amdahl law : only %i elements" % len(D)) try: coeffs_AmdahlR, matcov_AmdahlR = curve_fit(AmdahlR, N, D) # D_AmdahlR = AmdahlR(N, coeffs_AmdahlR[0], coeffs_AmdahlR[1]) coeffs_AmdahlR[1] = coeffs_AmdahlR[1] * coeffs_AmdahlR[0] / D[0] coeffs_AmdahlR[0] = D[0] print( "Amdahl Reduced Normalized: T=%.2f(%.6f+%.6f/N)" % (coeffs_AmdahlR[0], 1 - coeffs_AmdahlR[1], coeffs_AmdahlR[1]) ) except Exception: print("Impossible to fit for Reduced Amdahl law : only %i elements" % len(D)) try: coeffs_Mylq, matcov_Mylq = curve_fit(Mylq, N, D) coeffs_Mylq[1] = coeffs_Mylq[1] * coeffs_Mylq[0] / D[0] # coeffs_Mylq[2]=coeffs_Mylq[2]*coeffs_Mylq[0]/D[0] coeffs_Mylq[3] = coeffs_Mylq[3] * coeffs_Mylq[0] / D[0] coeffs_Mylq[0] = D[0] print( "Mylq Normalized : T=%.2f(%.6f+%.6f/N)+%.6f*N" % (coeffs_Mylq[0], coeffs_Mylq[1], coeffs_Mylq[3], coeffs_Mylq[2]) ) D_Mylq = Mylq(N, coeffs_Mylq[0], coeffs_Mylq[1], coeffs_Mylq[2], coeffs_Mylq[3]) except Exception: print("Impossible to fit for Mylq law : only %i elements" % len(D)) try: coeffs_Mylq2, matcov_Mylq2 = curve_fit(Mylq2, N, D) coeffs_Mylq2[1] = coeffs_Mylq2[1] * coeffs_Mylq2[0] / D[0] # coeffs_Mylq2[2]=coeffs_Mylq2[2]*coeffs_Mylq2[0]/D[0] # coeffs_Mylq2[3]=coeffs_Mylq2[3]*coeffs_Mylq2[0]/D[0] coeffs_Mylq2[4] = coeffs_Mylq2[4] * coeffs_Mylq2[0] / D[0] coeffs_Mylq2[0] = D[0] print( "Mylq 2nd order Normalized: T=%.2f(%.6f+%.6f/N)+%.6f*N+%.6f*N^2" % ( coeffs_Mylq2[0], coeffs_Mylq2[1], coeffs_Mylq2[4], coeffs_Mylq2[2], coeffs_Mylq2[3], ) ) except Exception: print("Impossible to fit for 2nd order Mylq law : only %i elements" % len(D)) if Curves: plt.xlabel("Number of Threads/work Items") plt.ylabel("Total Elapsed Time") (Experience,) = plt.plot(N, D, "ro") try: (pAmdahl,) = plt.plot(N, D_Amdahl, label="Loi de Amdahl") (pMylq,) = plt.plot(N, D_Mylq, label="Loi de Mylq") except Exception: print("Fit curves seem not to be available") plt.legend() plt.show() if __name__ == "__main__": # Set defaults values # GPU style can be Cuda (Nvidia implementation) or OpenCL GpuStyle = "OpenCL" # Iterations is integer Iterations = 1000000000 # BlocksBlocks in first number of Blocks to explore BlocksBegin = 1024 # BlocksEnd is last number of Blocks to explore BlocksEnd = 1024 # BlocksStep is the step of Blocks to explore BlocksStep = 1 # ThreadsBlocks in first number of Blocks to explore ThreadsBegin = 1 # ThreadsEnd is last number of Blocks to explore ThreadsEnd = 1 # ThreadsStep is the step of Blocks to explore ThreadsStep = 1 # Redo is the times to redo the test to improve metrology Redo = 1 # OutMetrology is method for duration estimation : False is GPU inside OutMetrology = False Metrology = "InMetro" # Curves is True to print the curves Curves = False # Fit is True to print the curves Fit = False # Inside based on If IfThen = False # Marsaglia RNG RNG = "MWC" # Value type : INT32, INT64, FP32, FP64 ValueType = "FP32" # Seeds for RNG Seeds = 110271, 101008 HowToUse = "%s -o (Out of Core Metrology) -c (Print Curves) -k (Case On IfThen) -d -g -i -b -e -s -f -l -t -r -m -v " # noqa: E501 try: opts, args = getopt.getopt( sys.argv[1:], "hockg:i:b:e:s:f:l:t:r:d:m:v:", [ "gpustyle=", "iterations=", "blocksBegin=", "blocksEnd=", "blocksStep=", "threadsFirst=", "threadsLast=", "threadssTep=", "redo=", "device=", "marsaglia=", "valuetype=", ], ) except getopt.GetoptError: print(HowToUse % sys.argv[0]) sys.exit(2) # List of Devices Devices = [] Alu = {} for opt, arg in opts: if opt == "-h": print(HowToUse % sys.argv[0]) print("\nInformations about devices detected under OpenCL API:") # For PyOpenCL import try: import pyopencl as cl Id = 0 for platform in cl.get_platforms(): for device in platform.get_devices(): # deviceType=cl.device_type.to_string(device.type) deviceType = "xPU" print( "Device #%i from %s of type %s : %s" % ( Id, platform.vendor.lstrip(), deviceType, device.name.lstrip(), ) ) Id = Id + 1 except Exception: print("Your platform does not seem to support OpenCL") print("\nInformations about devices detected under CUDA API:") # For PyCUDA import try: import pycuda.driver as cuda cuda.init() for Id in range(cuda.Device.count()): device = cuda.Device(Id) print("Device #%i of type GPU : %s" % (Id, device.name())) print except Exception: print("Your platform does not seem to support CUDA") sys.exit() elif opt == "-o": OutMetrology = True Metrology = "OutMetro" elif opt == "-c": Curves = True elif opt == "-k": IfThen = True elif opt in ("-d", "--device"): Devices.append(int(arg)) elif opt in ("-g", "--gpustyle"): GpuStyle = arg elif opt in ("-m", "--marsaglia"): RNG = arg elif opt in ("-v", "--valuetype"): ValueType = arg elif opt in ("-i", "--iterations"): Iterations = numpy.uint64(arg) elif opt in ("-b", "--blocksbegin"): BlocksBegin = int(arg) BlocksEnd = BlocksBegin elif opt in ("-e", "--blocksend"): BlocksEnd = int(arg) elif opt in ("-s", "--blocksstep"): BlocksStep = int(arg) elif opt in ("-f", "--threadsfirst"): ThreadsBegin = int(arg) ThreadsEnd = ThreadsBegin elif opt in ("-l", "--threadslast"): ThreadsEnd = int(arg) elif opt in ("-t", "--threadsstep"): ThreadsStep = int(arg) elif opt in ("-r", "--redo"): Redo = int(arg) # If no device has been specified, take the first one! if len(Devices) == 0: Devices.append(0) print("Devices Identification : %s" % Devices) print("GpuStyle used : %s" % GpuStyle) print("Iterations : %s" % Iterations) print("Number of Blocks on begin : %s" % BlocksBegin) print("Number of Blocks on end : %s" % BlocksEnd) print("Step on Blocks : %s" % BlocksStep) print("Number of Threads on begin : %s" % ThreadsBegin) print("Number of Threads on end : %s" % ThreadsEnd) print("Step on Threads : %s" % ThreadsStep) print("Number of redo : %s" % Redo) print("Metrology done out of XPU : %r" % OutMetrology) print("Type of Marsaglia RNG used : %s" % RNG) print("Type of variable : %s" % ValueType) if GpuStyle == "CUDA": try: # For PyCUDA import import pycuda.driver as cuda cuda.init() for Id in range(cuda.Device.count()): device = cuda.Device(Id) print("Device #%i of type GPU : %s" % (Id, device.name())) if Id in Devices: Alu[Id] = "GPU" except ImportError: print("Platform does not seem to support CUDA") if GpuStyle == "OpenCL": try: # For PyOpenCL import import pyopencl as cl Id = 0 for platform in cl.get_platforms(): for device in platform.get_devices(): # deviceType=cl.device_type.to_string(device.type) deviceType = "xPU" print( "Device #%i from %s of type %s : %s" % ( Id, platform.vendor.lstrip().rstrip(), deviceType, device.name.lstrip().rstrip(), ) ) if Id in Devices: # Set the Alu as detected Device Type Alu[Id] = deviceType Id = Id + 1 except ImportError: print("Platform does not seem to support OpenCL") # print(Devices,Alu) BlocksList = range(BlocksBegin, BlocksEnd + BlocksStep, BlocksStep) ThreadsList = range(ThreadsBegin, ThreadsEnd + ThreadsStep, ThreadsStep) ExploredJobs = numpy.array([]).astype(numpy.uint32) ExploredBlocks = numpy.array([]).astype(numpy.uint32) ExploredThreads = numpy.array([]).astype(numpy.uint32) avgD = numpy.array([]).astype(numpy.float32) medD = numpy.array([]).astype(numpy.float32) stdD = numpy.array([]).astype(numpy.float32) minD = numpy.array([]).astype(numpy.float32) maxD = numpy.array([]).astype(numpy.float32) avgR = numpy.array([]).astype(numpy.float32) medR = numpy.array([]).astype(numpy.float32) stdR = numpy.array([]).astype(numpy.float32) minR = numpy.array([]).astype(numpy.float32) maxR = numpy.array([]).astype(numpy.float32) for Blocks, Threads in itertools.product(BlocksList, ThreadsList): # print Blocks,Threads circle = numpy.zeros(Blocks * Threads).astype(numpy.uint64) ExploredJobs = numpy.append(ExploredJobs, Blocks * Threads) ExploredBlocks = numpy.append(ExploredBlocks, Blocks) ExploredThreads = numpy.append(ExploredThreads, Threads) if OutMetrology: DurationItem = numpy.array([]).astype(numpy.float32) Duration = numpy.array([]).astype(numpy.float32) Rate = numpy.array([]).astype(numpy.float32) for i in range(Redo): start = time.time() if GpuStyle == "CUDA": try: InputCU = {} InputCU["Iterations"] = Iterations InputCU["Steps"] = 1 InputCU["Blocks"] = Blocks InputCU["Threads"] = Threads InputCU["Device"] = Devices[0] InputCU["RNG"] = RNG InputCU["Seeds"] = Seeds InputCU["ValueType"] = ValueType InputCU["IfThen"] = IfThen OutputCU = MetropolisCuda(InputCU) Inside = OutputCU["Circle"] NewIterations = OutputCU["NewIterations"] Duration = OutputCU["Duration"] except Exception: print( "Problem with (%i,%i) // computations on Cuda" % (Blocks, Threads) ) elif GpuStyle == "OpenCL": try: InputCL = {} InputCL["Iterations"] = Iterations InputCL["Steps"] = 1 InputCL["Blocks"] = Blocks InputCL["Threads"] = Threads InputCL["Device"] = Devices[0] InputCL["RNG"] = RNG InputCL["Seeds"] = Seeds InputCL["ValueType"] = ValueType InputCL["IfThen"] = IfThen OutputCL = MetropolisOpenCL(InputCL) Inside = OutputCL["Circle"] NewIterations = OutputCL["NewIterations"] Duration = OutputCL["Duration"] except Exception: print( "Problem with (%i,%i) // computations on OpenCL" % (Blocks, Threads) ) Duration = numpy.append(Duration, time.time() - start) Rate = numpy.append(Rate, NewIterations / Duration[-1]) else: if GpuStyle == "CUDA": try: InputCU = {} InputCU["Iterations"] = Iterations InputCU["Steps"] = Redo InputCU["Blocks"] = Blocks InputCU["Threads"] = Threads InputCU["Device"] = Devices[0] InputCU["RNG"] = RNG InputCU["Seeds"] = Seeds InputCU["ValueType"] = ValueType InputCU["IfThen"] = IfThen OutputCU = MetropolisCuda(InputCU) Inside = OutputCU["Inside"] NewIterations = OutputCU["NewIterations"] Duration = OutputCU["Duration"] pycuda.context.pop() # noqa: F821 except Exception: print( "Problem with (%i,%i) // computations on Cuda" % (Blocks, Threads) ) elif GpuStyle == "OpenCL": try: InputCL = {} InputCL["Iterations"] = Iterations InputCL["Steps"] = Redo InputCL["Blocks"] = Blocks InputCL["Threads"] = Threads InputCL["Device"] = Devices[0] InputCL["RNG"] = RNG InputCL["Seeds"] = Seeds InputCL["ValueType"] = ValueType InputCL["IfThen"] = IfThen OutputCL = MetropolisOpenCL(InputCL) Inside = OutputCL["Inside"] NewIterations = OutputCL["NewIterations"] Duration = OutputCL["Duration"] except Exception: print( "Problem with (%i,%i) // computations on OpenCL" % (Blocks, Threads) ) Rate = NewIterations / Duration[-1] print( "Itops %i\nLogItops %.2f " % (int(Rate), numpy.log(Rate) / numpy.log(10)) ) print("Pi estimation %.8f" % (4.0 / NewIterations * Inside)) avgD = numpy.append(avgD, numpy.average(Duration)) medD = numpy.append(medD, numpy.median(Duration)) stdD = numpy.append(stdD, numpy.std(Duration)) minD = numpy.append(minD, numpy.min(Duration)) maxD = numpy.append(maxD, numpy.max(Duration)) avgR = numpy.append(avgR, numpy.average(Rate)) medR = numpy.append(medR, numpy.median(Rate)) stdR = numpy.append(stdR, numpy.std(Rate)) minR = numpy.append(minR, numpy.min(Rate)) maxR = numpy.append(maxR, numpy.max(Rate)) print( "%.2f %.2f %.2f %.2f %.2f %i %i %i %i %i" % ( avgD[-1], medD[-1], stdD[-1], minD[-1], maxD[-1], avgR[-1], medR[-1], stdR[-1], minR[-1], maxR[-1], ) ) numpy.savez( "Pi_%s_%s_%s_%s_%s_%s_%s_%s_%.8i_Device%i_%s_%s" % ( ValueType, RNG, Alu[Devices[0]], GpuStyle, BlocksBegin, BlocksEnd, ThreadsBegin, ThreadsEnd, Iterations, Devices[0], Metrology, gethostname(), ), ( ExploredBlocks, ExploredThreads, avgD, medD, stdD, minD, maxD, avgR, medR, stdR, minR, maxR, ), ) ToSave = [ ExploredBlocks, ExploredThreads, avgD, medD, stdD, minD, maxD, avgR, medR, stdR, minR, maxR, ] numpy.savetxt( "Pi_%s_%s_%s_%s_%s_%s_%s_%i_%.8i_Device%i_%s_%s" % ( ValueType, RNG, Alu[Devices[0]], GpuStyle, BlocksBegin, BlocksEnd, ThreadsBegin, ThreadsEnd, Iterations, Devices[0], Metrology, gethostname(), ), numpy.transpose(ToSave), fmt="%i %i %e %e %e %e %e %i %i %i %i %i", ) if Fit: # FIXME: undefined var 'median' FitAndPrint(ExploredJobs, median, Curves) # noqa: F821 pyopencl-2025.1/examples/svm.py0000644000000000000000000000370314332717401013364 0ustar00#!/usr/bin/env python import numpy as np import pyopencl as cl from pyopencl.characterize import ( has_coarse_grain_buffer_svm, has_fine_grain_buffer_svm, has_fine_grain_system_svm, ) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) dev = queue.device print( f"Device '{dev.name}' on platform '{dev.platform.name} ({dev.platform.version})'" " has the following SVM features:\n" f" Coarse-grained buffer SVM: {has_coarse_grain_buffer_svm(dev)}\n" f" Fine-grained buffer SVM: {has_fine_grain_buffer_svm(dev)}\n" f" Fine-grained system SVM: {has_fine_grain_system_svm(dev)}" ) prg = cl.Program(ctx, """ __kernel void twice( __global float *a_g) { int gid = get_global_id(0); a_g[gid] = 2*a_g[gid]; } """).build() if has_coarse_grain_buffer_svm(dev): print("Testing coarse-grained buffer SVM...", end="") svm_ary = cl.SVM(cl.csvm_empty(ctx, 10, np.float32)) assert isinstance(svm_ary.mem, np.ndarray) with svm_ary.map_rw(queue) as ary: ary.fill(17) # use from host orig_ary = ary.copy() prg.twice(queue, svm_ary.mem.shape, None, svm_ary) queue.finish() with svm_ary.map_ro(queue) as ary: assert np.array_equal(orig_ary*2, ary) print(" done.") if has_fine_grain_buffer_svm(dev): print("Testing fine-grained buffer SVM...", end="") ary = cl.fsvm_empty(ctx, 10, np.float32) assert isinstance(ary.base, cl.SVMAllocation) ary.fill(17) orig_ary = ary.copy() prg.twice(queue, ary.shape, None, cl.SVM(ary)) queue.finish() assert np.array_equal(orig_ary*2, ary) print(" done.") if has_fine_grain_system_svm(dev): print("Testing fine-grained system SVM...", end="") ary = np.zeros(10, np.float32) assert isinstance(ary, np.ndarray) ary.fill(17) orig_ary = ary.copy() prg.twice(queue, ary.shape, None, cl.SVM(ary)) queue.finish() assert np.array_equal(orig_ary*2, ary) print(" done.") pyopencl-2025.1/examples/transpose.py0000644000000000000000000001450514332717401014577 0ustar00# Transposition of a matrix # originally for PyCUDA by Hendrik Riedmann import numpy as np import numpy.linalg as la import pyopencl as cl block_size = 16 class NaiveTranspose: def __init__(self, ctx): self.kernel = ( cl.Program( ctx, """ __kernel void transpose( __global float *a_t, __global float *a, unsigned a_width, unsigned a_height) { int read_idx = get_global_id(0) + get_global_id(1) * a_width; int write_idx = get_global_id(1) + get_global_id(0) * a_height; a_t[write_idx] = a[read_idx]; } """,) .build() .transpose ) def __call__(self, queue, tgt, src, shape): w, h = shape assert w % block_size == 0 assert h % block_size == 0 return self.kernel( queue, (w, h), (block_size, block_size), tgt, src, np.uint32(w), np.uint32(h), ) class SillyTranspose(NaiveTranspose): def __call__(self, queue, tgt, src, shape): w, h = shape assert w % block_size == 0 assert h % block_size == 0 return self.kernel( queue, (w, h), None, tgt, src, np.uint32(w), np.uint32(h) ) class TransposeWithLocal: def __init__(self, ctx): self.kernel = ( cl.Program( ctx, """ #define BLOCK_SIZE %(block_size)d #define A_BLOCK_STRIDE (BLOCK_SIZE * a_width) #define A_T_BLOCK_STRIDE (BLOCK_SIZE * a_height) __kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1))) void transpose( __global float *a_t, __global float *a, unsigned a_width, unsigned a_height, __local float *a_local) { int base_idx_a = get_group_id(0) * BLOCK_SIZE + get_group_id(1) * A_BLOCK_STRIDE; int base_idx_a_t = get_group_id(1) * BLOCK_SIZE + get_group_id(0) * A_T_BLOCK_STRIDE; int glob_idx_a = base_idx_a + get_local_id(0) + a_width * get_local_id(1); int glob_idx_a_t = base_idx_a_t + get_local_id(0) + a_height * get_local_id(1); a_local[get_local_id(1)*BLOCK_SIZE+get_local_id(0)] = a[glob_idx_a]; barrier(CLK_LOCAL_MEM_FENCE); a_t[glob_idx_a_t] = a_local[get_local_id(0)*BLOCK_SIZE+get_local_id(1)]; } """ % {"block_size": block_size}, ) .build() .transpose ) def __call__(self, queue, tgt, src, shape): w, h = shape assert w % block_size == 0 assert h % block_size == 0 return self.kernel( queue, (w, h), (block_size, block_size), tgt, src, np.uint32(w), np.uint32(h), cl.LocalMemory(4 * block_size * (block_size + 1)), ) def transpose_using_cl(ctx, queue, cpu_src, cls): mf = cl.mem_flags a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=cpu_src) a_t_buf = cl.Buffer(ctx, mf.WRITE_ONLY, size=cpu_src.nbytes) cls(ctx)(queue, a_t_buf, a_buf, cpu_src.shape) w, h = cpu_src.shape result = np.empty((h, w), dtype=cpu_src.dtype) cl.enqueue_copy(queue, result, a_t_buf).wait() a_buf.release() a_t_buf.release() return result def check_transpose(): for cls in [NaiveTranspose, SillyTranspose, TransposeWithLocal]: print("checking", cls.__name__) ctx = cl.create_some_context() for dev in ctx.devices: assert dev.local_mem_size > 0 queue = cl.CommandQueue(ctx) for i in np.arange(10, 13, 0.125): size = int(((2 ** i) // 32) * 32) print(size) rng = np.random.default_rng() source = rng.random((size, size), dtype=np.float32) result = transpose_using_cl(ctx, queue, source, NaiveTranspose) err = source.T - result err_norm = la.norm(err) assert err_norm == 0, (size, err_norm) def benchmark_transpose(): ctx = cl.create_some_context() for dev in ctx.devices: assert dev.local_mem_size > 0 queue = cl.CommandQueue( ctx, properties=cl.command_queue_properties.PROFILING_ENABLE ) sizes = [int(((2 ** i) // 32) * 32) for i in np.arange(10, 13, 0.125)] # for i in np.arange(10, 10.5, 0.125)] mem_bandwidths = {} methods = [SillyTranspose, NaiveTranspose, TransposeWithLocal] for cls in methods: name = cls.__name__.replace("Transpose", "") mem_bandwidths[cls] = meth_mem_bws = [] for size in sizes: rng = np.random.default_rng() source = rng.random((size, size), dtype=np.float32) mf = cl.mem_flags a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=source) a_t_buf = cl.Buffer(ctx, mf.WRITE_ONLY, size=source.nbytes) method = cls(ctx) for _i in range(4): method(queue, a_t_buf, a_buf, source.shape) count = 12 events = [] for _i in range(count): events.append(method(queue, a_t_buf, a_buf, source.shape)) events[-1].wait() time = sum(evt.profile.end - evt.profile.start for evt in events) mem_bw = 2 * source.nbytes * count / (time * 1e-9) print("benchmarking", name, size, mem_bw / 1e9, "GB/s") meth_mem_bws.append(mem_bw) a_buf.release() a_t_buf.release() try: from matplotlib.pyplot import clf, grid, legend, plot, savefig, xlabel, ylabel except ModuleNotFoundError: pass else: for i in range(len(methods)): clf() for j in range(i + 1): method = methods[j] name = method.__name__.replace("Transpose", "") plot(sizes, np.array(mem_bandwidths[method]) / 1e9, "o-", label=name) xlabel("Matrix width/height $N$") ylabel("Memory Bandwidth [GB/s]") legend(loc="best") grid() savefig("transpose-benchmark-%d.pdf" % i) check_transpose() benchmark_transpose() pyopencl-2025.1/pyopencl/__init__.py0000644000000000000000000023704314332717401014337 0ustar00from __future__ import annotations __copyright__ = "Copyright (C) 2009-15 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import logging from sys import intern from typing import Any, Sequence from warnings import warn # must import, otherwise dtype registry will not be fully populated import pyopencl.cltypes from pyopencl.version import VERSION, VERSION_STATUS, VERSION_TEXT # noqa: F401 __version__ = VERSION_TEXT logger = logging.getLogger(__name__) # This supports ocl-icd find shipped OpenCL ICDs, cf. # https://github.com/isuruf/ocl-icd/commit/3862386b51930f95d9ad1089f7157a98165d5a6b # via # https://github.com/inducer/pyopencl/blob/0b3d0ef92497e6838eea300b974f385f94cb5100/scripts/build-wheels.sh#L43-L44 import os os.environ["PYOPENCL_HOME"] = os.path.dirname(os.path.abspath(__file__)) try: import pyopencl._cl as _cl except ImportError: from os.path import dirname, join, realpath if realpath(join(os.getcwd(), "pyopencl")) == realpath(dirname(__file__)): warn( "It looks like you are importing PyOpenCL from " "its source directory. This likely won't work.", stacklevel=2) raise import numpy as np import sys _PYPY = "__pypy__" in sys.builtin_module_names from pyopencl._cl import ( # noqa: F401 get_cl_header_version, program_kind, status_code, platform_info, device_type, device_info, device_topology_type_amd, device_fp_config, device_mem_cache_type, device_local_mem_type, device_exec_capabilities, device_svm_capabilities, command_queue_properties, context_info, gl_context_info, context_properties, command_queue_info, queue_properties, mem_flags, svm_mem_flags, channel_order, channel_type, mem_object_type, mem_info, image_info, pipe_info, pipe_properties, addressing_mode, filter_mode, sampler_info, sampler_properties, map_flags, program_info, program_build_info, program_binary_type, kernel_info, kernel_arg_info, kernel_arg_address_qualifier, kernel_arg_access_qualifier, kernel_arg_type_qualifier, kernel_work_group_info, kernel_sub_group_info, event_info, command_type, command_execution_status, profiling_info, mem_migration_flags, device_partition_property, device_affinity_domain, device_atomic_capabilities, device_device_enqueue_capabilities, version_bits, khronos_vendor_id, Error, MemoryError, LogicError, RuntimeError, Platform, get_platforms, Device, Context, CommandQueue, LocalMemory, MemoryObjectHolder, MemoryObject, MemoryMap, Buffer, _Program, Kernel, Event, wait_for_events, NannyEvent, enqueue_nd_range_kernel, _enqueue_marker, _enqueue_read_buffer, _enqueue_write_buffer, _enqueue_copy_buffer, _enqueue_read_buffer_rect, _enqueue_write_buffer_rect, _enqueue_copy_buffer_rect, _enqueue_read_image, _enqueue_copy_image, _enqueue_write_image, _enqueue_copy_image_to_buffer, _enqueue_copy_buffer_to_image, have_gl, ImageFormat, get_supported_image_formats, Image, Sampler, # This class is available unconditionally, even though CL only # has it on CL2.0 and newer. Pipe, ) try: from pyopencl._cl import DeviceTopologyAmd # noqa: F401 from pyopencl._cl import enqueue_copy_buffer_p2p_amd # noqa: F401 except ImportError: pass if not _PYPY: # FIXME: Add back to default set when pypy support catches up from pyopencl._cl import enqueue_map_buffer # noqa: F401 from pyopencl._cl import enqueue_map_image # noqa: F401 if get_cl_header_version() >= (1, 1): from pyopencl._cl import UserEvent # noqa: F401 if get_cl_header_version() >= (1, 2): from pyopencl._cl import ImageDescriptor from pyopencl._cl import ( # noqa: F401 _enqueue_barrier_with_wait_list, _enqueue_fill_buffer, _enqueue_marker_with_wait_list, enqueue_fill_image, enqueue_migrate_mem_objects, unload_platform_compiler) if get_cl_header_version() >= (2, 0): from pyopencl._cl import SVM, SVMAllocation, SVMPointer if _cl.have_gl(): from pyopencl._cl import ( # noqa: F401 GLBuffer, GLRenderBuffer, GLTexture, gl_object_type, gl_texture_info) try: from pyopencl._cl import get_apple_cgl_share_group # noqa: F401 except ImportError: pass try: from pyopencl._cl import enqueue_acquire_gl_objects # noqa: F401 from pyopencl._cl import enqueue_release_gl_objects # noqa: F401 except ImportError: pass import inspect as _inspect CONSTANT_CLASSES = tuple( getattr(_cl, name) for name in dir(_cl) if _inspect.isclass(getattr(_cl, name)) and name[0].islower() and name not in ["zip", "map", "range"]) BITFIELD_CONSTANT_CLASSES = ( _cl.device_type, _cl.device_fp_config, _cl.device_exec_capabilities, _cl.command_queue_properties, _cl.mem_flags, _cl.map_flags, _cl.kernel_arg_type_qualifier, _cl.device_affinity_domain, _cl.mem_migration_flags, _cl.device_svm_capabilities, _cl.queue_properties, _cl.svm_mem_flags, _cl.device_atomic_capabilities, _cl.device_device_enqueue_capabilities, _cl.version_bits, ) # {{{ diagnostics class CompilerWarning(UserWarning): pass class CommandQueueUsedAfterExit(UserWarning): pass def compiler_output(text: str) -> None: from pytools import strtobool if strtobool(os.environ.get("PYOPENCL_COMPILER_OUTPUT", "False")): warn(text, CompilerWarning, stacklevel=3) else: warn("Non-empty compiler output encountered. Set the " "environment variable PYOPENCL_COMPILER_OUTPUT=1 " "to see more.", CompilerWarning, stacklevel=3) # }}} # {{{ find pyopencl shipped source code def _find_pyopencl_include_path() -> str: from os.path import abspath, dirname, exists, join # Try to find the include path in the same directory as this file include_path = join(abspath(dirname(__file__)), "cl") if not exists(include_path): try: # NOTE: only available in Python >=3.9 from importlib.resources import files except ImportError: from importlib_resources import files # type: ignore[no-redef] include_path = str(files("pyopencl") / "cl") if not exists(include_path): raise OSError("Unable to find PyOpenCL include path") # Quote the path if it contains a space and is not quoted already. # See https://github.com/inducer/pyopencl/issues/250 for discussion. if " " in include_path and not include_path.startswith('"'): return '"' + include_path + '"' else: return include_path # }}} # {{{ build option munging def _split_options_if_necessary(options): if isinstance(options, str): import shlex options = shlex.split(options) return options def _find_include_path(options): def unquote(path): if path.startswith('"') and path.endswith('"'): return path[1:-1] else: return path include_path = ["."] option_idx = 0 while option_idx < len(options): option = options[option_idx].strip() if option.startswith("-I") or option.startswith("/I"): if len(option) == 2: if option_idx+1 < len(options): include_path.append(unquote(options[option_idx+1])) option_idx += 2 else: include_path.append(unquote(option[2:].lstrip())) option_idx += 1 else: option_idx += 1 # }}} return include_path def _options_to_bytestring(options): def encode_if_necessary(s): if isinstance(s, str): return s.encode("utf-8") else: return s return b" ".join(encode_if_necessary(s) for s in options) # }}} # {{{ Program (wrapper around _Program, adds caching support) from pytools import strtobool _PYOPENCL_NO_CACHE = strtobool(os.environ.get("PYOPENCL_NO_CACHE", "false")) _DEFAULT_BUILD_OPTIONS: list[str] = [] _DEFAULT_INCLUDE_OPTIONS: list[str] = ["-I", _find_pyopencl_include_path()] # map of platform.name to build options list _PLAT_BUILD_OPTIONS: dict[str, list[str]] = { "Oclgrind": ["-D", "PYOPENCL_USING_OCLGRIND"], } def enable_debugging(platform_or_context): """Enables debugging for all code subsequently compiled by PyOpenCL on the passed *platform*. Alternatively, a context may be passed. """ if isinstance(platform_or_context, Context): platform = platform_or_context.devices[0].platform else: platform = platform_or_context if "AMD Accelerated" in platform.name: _PLAT_BUILD_OPTIONS.setdefault(platform.name, []).extend( ["-g", "-O0"]) os.environ["CPU_MAX_COMPUTE_UNITS"] = "1" else: warn(f"Do not know how to enable debugging on '{platform.name}'", stacklevel=2) class Program: def __init__(self, arg1, arg2=None, arg3=None): if arg2 is None: # 1-argument form: program self._prg = arg1 self._context = self._prg.get_info(program_info.CONTEXT) elif arg3 is None: # 2-argument form: context, source context, source = arg1, arg2 from pyopencl.tools import is_spirv if is_spirv(source): # FIXME no caching in SPIR-V case self._context = context self._prg = _cl._create_program_with_il(context, source) return self._context = context self._source = source self._prg = None else: context, device, binaries = arg1, arg2, arg3 self._context = context self._prg = _cl._Program(context, device, binaries) self._build_duration_info = None def _get_prg(self): if self._prg is not None: return self._prg else: # "no program" can only happen in from-source case. warn("Pre-build attribute access defeats compiler caching.", stacklevel=3) self._prg = _cl._Program(self._context, self._source) return self._prg def get_info(self, arg): return self._get_prg().get_info(arg) def get_build_info(self, *args, **kwargs): return self._get_prg().get_build_info(*args, **kwargs) def all_kernels(self): return self._get_prg().all_kernels() @property def int_ptr(self): return self._get_prg().int_ptr int_ptr.__doc__ = _cl._Program.int_ptr.__doc__ @staticmethod def from_int_ptr(int_ptr_value, retain=True): return Program(_cl._Program.from_int_ptr(int_ptr_value, retain)) from_int_ptr.__doc__ = _cl._Program.from_int_ptr.__doc__ def __getattr__(self, attr): try: knl = Kernel(self, attr) # Nvidia does not raise errors even for invalid names, # but this will give an error if the kernel is invalid. knl.num_args # noqa: B018 if self._build_duration_info is not None: build_descr, _was_cached, duration = self._build_duration_info if duration > 0.2: logger.info( "build program: kernel '%s' was part of a " "lengthy %s (%.2f s)", attr, build_descr, duration) # don't whine about build times more than once. self._build_duration_info = None return knl except LogicError as err: raise AttributeError("'%s' was not found as a program " "info attribute or as a kernel name" % attr) from err # {{{ build @classmethod def _process_build_options(cls, context, options, _add_include_path=False): if options is None: options = [] if isinstance(options, tuple): options = list(options) options = _split_options_if_necessary(options) options = (options + _DEFAULT_BUILD_OPTIONS + _DEFAULT_INCLUDE_OPTIONS + _PLAT_BUILD_OPTIONS.get( context.devices[0].platform.name, [])) forced_options = os.environ.get("PYOPENCL_BUILD_OPTIONS") if forced_options: options = options + forced_options.split() return ( _options_to_bytestring(options), _find_include_path(options)) def build(self, options=None, devices=None, cache_dir=None): options_bytes, include_path = self._process_build_options( self._context, options) if cache_dir is None: cache_dir = getattr(self._context, "cache_dir", None) build_descr = None from pyopencl.characterize import has_src_build_cache if ( (_PYOPENCL_NO_CACHE or has_src_build_cache(self._context.devices[0])) and self._prg is None): if _PYOPENCL_NO_CACHE: build_descr = "uncached source build (cache disabled by user)" else: build_descr = "uncached source build (assuming cached by ICD)" self._prg = _cl._Program(self._context, self._source) from time import time start_time = time() was_cached = False if self._prg is not None: # uncached if build_descr is None: build_descr = "uncached source build" self._build_and_catch_errors( lambda: self._prg.build(options_bytes, devices), options_bytes=options_bytes) else: # cached from pyopencl.cache import create_built_program_from_source_cached self._prg, was_cached = self._build_and_catch_errors( lambda: create_built_program_from_source_cached( self._context, self._source, options_bytes, devices, cache_dir=cache_dir, include_path=include_path), options_bytes=options_bytes, source=self._source) if was_cached: build_descr = "cache retrieval" else: build_descr = "source build resulting from a binary cache miss" del self._context end_time = time() self._build_duration_info = (build_descr, was_cached, end_time-start_time) return self def _build_and_catch_errors(self, build_func, options_bytes, source=None): try: return build_func() except RuntimeError as e: msg = str(e) if options_bytes: msg = msg + "\n(options: %s)" % options_bytes.decode("utf-8") if source is not None: from tempfile import NamedTemporaryFile srcfile = NamedTemporaryFile(mode="wt", delete=False, suffix=".cl") try: srcfile.write(source) finally: srcfile.close() msg = msg + "\n(source saved as %s)" % srcfile.name code = e.code routine = e.routine err = RuntimeError( _cl._ErrorRecord( msg=msg, code=code, routine=routine)) # Python 3.2 outputs the whole list of currently active exceptions # This serves to remove one (redundant) level from that nesting. raise err # }}} def compile(self, options=None, devices=None, headers=None): if headers is None: headers = [] options_bytes, _ = self._process_build_options(self._context, options) self._get_prg().compile(options_bytes, devices, [(name, prg._get_prg()) for name, prg in headers]) return self def __eq__(self, other): return self._get_prg() == other._get_prg() def __ne__(self, other): return self._get_prg() == other._get_prg() def __hash__(self): return hash(self._get_prg()) def create_program_with_built_in_kernels(context, devices, kernel_names): if not isinstance(kernel_names, str): kernel_names = ":".join(kernel_names) return Program(_Program.create_with_built_in_kernels( context, devices, kernel_names)) def link_program(context, programs, options=None, devices=None): if options is None: options = [] options_bytes = _options_to_bytestring(_split_options_if_necessary(options)) programs = [prg._get_prg() for prg in programs] raw_prg = _Program.link(context, programs, options_bytes, devices) return Program(raw_prg) # }}} # {{{ monkeypatch C++ wrappers to add functionality def _add_functionality(): def generic_get_cl_version(self): import re version_string = self.version match = re.match(r"^OpenCL ([0-9]+)\.([0-9]+) .*$", version_string) if match is None: raise RuntimeError("%s %s returned non-conformant " "platform version string '%s'" % (type(self).__name__, self, version_string)) return int(match.group(1)), int(match.group(2)) # {{{ Platform def platform_repr(self): return f"" Platform.__repr__ = platform_repr Platform._get_cl_version = generic_get_cl_version # }}} # {{{ Device def device_repr(self): return "".format( self.name.strip(), self.platform.name.strip(), self.int_ptr) def device_hashable_model_and_version_identifier(self): return ("v1", self.vendor, self.vendor_id, self.name, self.version) def device_persistent_unique_id(self): warn("Device.persistent_unique_id is deprecated. " "Use Device.hashable_model_and_version_identifier instead.", DeprecationWarning, stacklevel=2) return device_hashable_model_and_version_identifier(self) Device.__repr__ = device_repr # undocumented for now: Device._get_cl_version = generic_get_cl_version Device.hashable_model_and_version_identifier = property( device_hashable_model_and_version_identifier) Device.persistent_unique_id = property(device_persistent_unique_id) # }}} # {{{ Context def context_repr(self): return "".format(self.int_ptr, ", ".join(repr(dev) for dev in self.devices)) def context_get_cl_version(self): return self.devices[0].platform._get_cl_version() Context.__repr__ = context_repr from pytools import memoize_method Context._get_cl_version = memoize_method(context_get_cl_version) # }}} # {{{ CommandQueue def command_queue_enter(self): return self def command_queue_exit(self, exc_type, exc_val, exc_tb): self.finish() self._finalize() def command_queue_get_cl_version(self): return self.device._get_cl_version() CommandQueue.__enter__ = command_queue_enter CommandQueue.__exit__ = command_queue_exit CommandQueue._get_cl_version = memoize_method(command_queue_get_cl_version) # }}} # {{{ _Program (the internal, non-caching version) def program_get_build_logs(self): build_logs = [] for dev in self.get_info(_cl.program_info.DEVICES): try: log = self.get_build_info(dev, program_build_info.LOG) except Exception: log = "" build_logs.append((dev, log)) return build_logs def program_build(self, options_bytes, devices=None): err = None try: self._build(options=options_bytes, devices=devices) except Error as e: msg = str(e) + "\n\n" + (75*"="+"\n").join( f"Build on {dev}:\n\n{log}" for dev, log in self._get_build_logs()) code = e.code routine = e.routine err = _cl.RuntimeError( _cl._ErrorRecord( msg=msg, code=code, routine=routine)) if err is not None: # Python 3.2 outputs the whole list of currently active exceptions # This serves to remove one (redundant) level from that nesting. raise err message = (75*"="+"\n").join( f"Build on {dev} succeeded, but said:\n\n{log}" for dev, log in self._get_build_logs() if log is not None and log.strip()) if message: if self.kind() == program_kind.SOURCE: build_type = "From-source build" elif self.kind() == program_kind.BINARY: build_type = "From-binary build" elif self.kind() == program_kind.IL: build_type = "From-IL build" else: build_type = "Build" compiler_output("%s succeeded, but resulted in non-empty logs:\n%s" % (build_type, message)) return self _cl._Program._get_build_logs = program_get_build_logs _cl._Program.build = program_build # }}} # {{{ Event class ProfilingInfoGetter: def __init__(self, event): self.event = event def __getattr__(self, name): info_cls = _cl.profiling_info try: inf_attr = getattr(info_cls, name.upper()) except AttributeError as err: raise AttributeError("%s has no attribute '%s'" % (type(self), name)) from err else: return self.event.get_profiling_info(inf_attr) _cl.Event.profile = property(ProfilingInfoGetter) # }}} # {{{ Kernel kernel_old_get_info = Kernel.get_info kernel_old_get_work_group_info = Kernel.get_work_group_info def kernel_set_arg_types(self, arg_types): arg_types = tuple(arg_types) # {{{ arg counting bug handling # For example: # https://github.com/pocl/pocl/issues/197 # (but Apple CPU has a similar bug) work_around_arg_count_bug = False warn_about_arg_count_bug = False from pyopencl.characterize import has_struct_arg_count_bug count_bug_per_dev = [ has_struct_arg_count_bug(dev, self.context) for dev in self.context.devices] from pytools import single_valued if any(count_bug_per_dev): if all(count_bug_per_dev): work_around_arg_count_bug = single_valued(count_bug_per_dev) else: warn_about_arg_count_bug = True # }}} from pyopencl.invoker import generate_enqueue_and_set_args self._set_enqueue_and_set_args( *generate_enqueue_and_set_args( self.function_name, len(arg_types), self.num_args, arg_types, warn_about_arg_count_bug=warn_about_arg_count_bug, work_around_arg_count_bug=work_around_arg_count_bug, devs=self.context.devices)) def kernel_get_work_group_info(self, param, device): try: wg_info_cache = self._wg_info_cache except AttributeError: wg_info_cache = self._wg_info_cache = {} cache_key = (param, device.int_ptr) try: return wg_info_cache[cache_key] except KeyError: pass result = kernel_old_get_work_group_info(self, param, device) wg_info_cache[cache_key] = result return result def kernel_capture_call(self, output_file, queue, global_size, local_size, *args, **kwargs): from pyopencl.capture_call import capture_kernel_call capture_kernel_call(self, output_file, queue, global_size, local_size, *args, **kwargs) def kernel_get_info(self, param_name): val = kernel_old_get_info(self, param_name) if isinstance(val, _Program): return Program(val) else: return val Kernel.get_work_group_info = kernel_get_work_group_info # FIXME: Possibly deprecate this version Kernel.set_scalar_arg_dtypes = kernel_set_arg_types Kernel.set_arg_types = kernel_set_arg_types Kernel.capture_call = kernel_capture_call Kernel.get_info = kernel_get_info # }}} # {{{ ImageFormat def image_format_repr(self): return "ImageFormat({}, {})".format( channel_order.to_string(self.channel_order, ""), channel_type.to_string(self.channel_data_type, "")) def image_format_eq(self, other): return (self.channel_order == other.channel_order and self.channel_data_type == other.channel_data_type) def image_format_ne(self, other): return not image_format_eq(self, other) def image_format_hash(self): return hash((type(self), self.channel_order, self.channel_data_type)) ImageFormat.__repr__ = image_format_repr ImageFormat.__eq__ = image_format_eq ImageFormat.__ne__ = image_format_ne ImageFormat.__hash__ = image_format_hash # }}} # {{{ Image def image_init( self, context, flags, format, shape=None, pitches=None, hostbuf=None, is_array=False, buffer=None, *, desc: ImageDescriptor | None = None, _through_create_image: bool = False, ) -> None: if hostbuf is not None and not \ (flags & (mem_flags.USE_HOST_PTR | mem_flags.COPY_HOST_PTR)): warn("'hostbuf' was passed, but no memory flags to make use of it.", stacklevel=2) if desc is not None: if shape is not None: raise TypeError("shape may not be passed when using descriptor") if pitches is not None: raise TypeError("pitches may not be passed when using descriptor") if is_array: raise TypeError("is_array may not be passed when using descriptor") if buffer is not None: raise TypeError("is_array may not be passed when using descriptor") Image._custom_init(self, context, flags, format, desc, hostbuf) return if shape is None and hostbuf is None: raise Error("'shape' must be passed if 'hostbuf' is not given") if shape is None and hostbuf is not None: shape = hostbuf.shape if hostbuf is None and pitches is not None: raise Error("'pitches' may only be given if 'hostbuf' is given") if context._get_cl_version() >= (1, 2) and get_cl_header_version() >= (1, 2): if not _through_create_image: warn("Non-descriptor Image constructor called. " "This will stop working in 2026. " "Use create_image instead (with the same arguments).", DeprecationWarning, stacklevel=2) if buffer is not None and is_array: raise ValueError( "'buffer' and 'is_array' are mutually exclusive") if len(shape) == 3: if buffer is not None: raise TypeError( "'buffer' argument is not supported for 3D arrays") elif is_array: image_type = mem_object_type.IMAGE2D_ARRAY else: image_type = mem_object_type.IMAGE3D elif len(shape) == 2: if buffer is not None: raise TypeError( "'buffer' argument is not supported for 2D arrays") elif is_array: image_type = mem_object_type.IMAGE1D_ARRAY else: image_type = mem_object_type.IMAGE2D elif len(shape) == 1: if buffer is not None: image_type = mem_object_type.IMAGE1D_BUFFER elif is_array: raise TypeError("array of zero-dimensional images not supported") else: image_type = mem_object_type.IMAGE1D else: raise ValueError("images cannot have more than three dimensions") desc = ImageDescriptor() \ # pylint: disable=possibly-used-before-assignment desc.image_type = image_type desc.shape = shape # also sets desc.array_size if pitches is None: desc.pitches = (0, 0) else: desc.pitches = pitches desc.num_mip_levels = 0 # per CL 1.2 spec desc.num_samples = 0 # per CL 1.2 spec desc.buffer = buffer Image._custom_init(self, context, flags, format, desc, hostbuf) else: # legacy init for CL 1.1 and older if is_array: raise TypeError("'is_array=True' is not supported for CL < 1.2") # if num_mip_levels is not None: # raise TypeError( # "'num_mip_levels' argument is not supported for CL < 1.2") # if num_samples is not None: # raise TypeError( # "'num_samples' argument is not supported for CL < 1.2") if buffer is not None: raise TypeError("'buffer' argument is not supported for CL < 1.2") Image._custom_init(self, context, flags, format, shape, pitches, hostbuf) class _ImageInfoGetter: def __init__(self, event): warn( "Image.image.attr is deprecated and will go away in 2021. " "Use Image.attr directly, instead.", stacklevel=2) self.event = event def __getattr__(self, name): try: inf_attr = getattr(_cl.image_info, name.upper()) except AttributeError as err: raise AttributeError("%s has no attribute '%s'" % (type(self), name)) from err else: return self.event.get_image_info(inf_attr) def image_shape(self): if self.type == mem_object_type.IMAGE2D: return (self.width, self.height) elif self.type == mem_object_type.IMAGE3D: return (self.width, self.height, self.depth) else: raise LogicError("only images have shapes") Image.__init__ = image_init Image.image = property(_ImageInfoGetter) Image.shape = property(image_shape) # }}} # {{{ Error def error_str(self): val = self.what try: val.routine # noqa: B018 except AttributeError: return str(val) else: result = "" if val.code() != status_code.SUCCESS: result = status_code.to_string( val.code(), "") routine = val.routine() if routine: result = f"{routine} failed: {result}" what = val.what() if what: if result: result += " - " result += what return result def error_code(self): return self.args[0].code() def error_routine(self): return self.args[0].routine() def error_what(self): return self.args[0] Error.__str__ = error_str Error.code = property(error_code) Error.routine = property(error_routine) Error.what = property(error_what) # }}} # {{{ MemoryMap def memory_map_enter(self): return self def memory_map_exit(self, exc_type, exc_val, exc_tb): self.release() MemoryMap.__doc__ = """ This class may also be used as a context manager in a ``with`` statement. The memory corresponding to this object will be unmapped when this object is deleted or :meth:`release` is called. .. automethod:: release """ MemoryMap.__enter__ = memory_map_enter MemoryMap.__exit__ = memory_map_exit # }}} # {{{ SVMPointer if get_cl_header_version() >= (2, 0): SVMPointer.__doc__ = """A base class for things that can be passed to functions that allow an SVM pointer, e.g. kernel enqueues and memory copies. Objects of this type cannot currently be directly created or implemented in Python. To obtain objects implementing this type, consider its subtypes :class:`SVMAllocation` and :class:`SVM`. .. property:: svm_ptr Gives the SVM pointer as an :class:`int`. .. property:: size An :class:`int` denoting the size in bytes, or *None*, if the size of the SVM pointed to is not known. *Most* objects of this type (e.g. instances of :class:`SVMAllocation` and :class:`SVM` know their size, so that, for example :class:`enqueue_copy` will automatically copy an entire :class:`SVMAllocation` when a size is not explicitly specified. .. automethod:: map .. automethod:: map_ro .. automethod:: map_rw .. automethod:: as_buffer .. property:: buf An opaque object implementing the :c:func:`Python buffer protocol `. It exposes the pointed-to memory as a one-dimensional buffer of bytes, with the size matching :attr:`size`. No guarantee is provided that two references to this attribute result in the same object. """ def svmptr_map(self, queue: CommandQueue, *, flags: int, is_blocking: bool = True, wait_for: Sequence[Event] | None = None, size: Event | None = None) -> SVMMap: """ :arg is_blocking: If *False*, subsequent code must wait on :attr:`SVMMap.event` in the returned object before accessing the mapped memory. :arg flags: a combination of :class:`pyopencl.map_flags`. :arg size: The size of the map in bytes. If not provided, defaults to :attr:`size`. |std-enqueue-blurb| """ return SVMMap(self, np.asarray(self.buf), queue, _cl._enqueue_svm_map(queue, is_blocking, flags, self, wait_for, size=size)) def svmptr_map_ro(self, queue: CommandQueue, *, is_blocking: bool = True, wait_for: Sequence[Event] | None = None, size: int | None = None) -> SVMMap: """Like :meth:`map`, but with *flags* set for a read-only map. """ return self.map(queue, flags=map_flags.READ, is_blocking=is_blocking, wait_for=wait_for, size=size) def svmptr_map_rw(self, queue: CommandQueue, *, is_blocking: bool = True, wait_for: Sequence[Event] | None = None, size: int | None = None) -> SVMMap: """Like :meth:`map`, but with *flags* set for a read-only map. """ return self.map(queue, flags=map_flags.READ | map_flags.WRITE, is_blocking=is_blocking, wait_for=wait_for, size=size) def svmptr__enqueue_unmap(self, queue, wait_for=None): return _cl._enqueue_svm_unmap(queue, self, wait_for) def svmptr_as_buffer(self, ctx: Context, *, flags: int | None = None, size: int | None = None) -> Buffer: """ :arg ctx: a :class:`Context` :arg flags: a combination of :class:`pyopencl.map_flags`, defaults to read-write. :arg size: The size of the map in bytes. If not provided, defaults to :attr:`size`. :returns: a :class:`Buffer` corresponding to *self*. The memory referred to by this object must not be freed before the returned :class:`Buffer` is released. """ if flags is None: flags = mem_flags.READ_WRITE | mem_flags.USE_HOST_PTR if size is None: size = self.size return Buffer(ctx, flags, size=size, hostbuf=self.buf) if get_cl_header_version() >= (2, 0): SVMPointer.map = svmptr_map SVMPointer.map_ro = svmptr_map_ro SVMPointer.map_rw = svmptr_map_rw SVMPointer._enqueue_unmap = svmptr__enqueue_unmap SVMPointer.as_buffer = svmptr_as_buffer # }}} # {{{ SVMAllocation if get_cl_header_version() >= (2, 0): SVMAllocation.__doc__ = """ Is a :class:`SVMPointer`. .. versionadded:: 2016.2 .. automethod:: __init__ :arg flags: See :class:`svm_mem_flags`. :arg queue: If not specified, the allocation will be freed eagerly, irrespective of whether pending/enqueued operations are still using this memory. If specified, deallocation of the memory will be enqueued with the given queue, and will only be performed after previously-enqueue operations in the queue have completed. It is an error to specify an out-of-order queue. .. warning:: Not specifying a queue will typically lead to undesired behavior, including crashes and memory corruption. See the warning in :ref:`svm`. .. automethod:: enqueue_release Enqueue the release of this allocation into *queue*. If *queue* is not specified, enqueue the deallocation into the queue provided at allocation time or via :class:`bind_to_queue`. .. automethod:: bind_to_queue Change the queue used for implicit enqueue of deallocation to *queue*. Sufficient synchronization is ensured by enqueuing a marker into the old queue and waiting on this marker in the new queue. .. automethod:: unbind_from_queue Configure the allocation to no longer implicitly enqueue memory allocation. If such a queue was previously provided, :meth:`~CommandQueue.finish` is automatically called on it. """ # }}} # {{{ SVM if get_cl_header_version() >= (2, 0): SVM.__doc__ = """Tags an object exhibiting the Python buffer interface (such as a :class:`numpy.ndarray`) as referring to shared virtual memory. Is a :class:`SVMPointer`, hence objects of this type may be passed to kernel calls and :func:`enqueue_copy`, and all methods declared there are also available there. Note that :meth:`map` differs slightly from :meth:`SVMPointer.map`. Depending on the features of the OpenCL implementation, the following types of objects may be passed to/wrapped in this type: * fine-grain shared memory as returned by (e.g.) :func:`fsvm_empty`, if the implementation supports fine-grained shared virtual memory. This memory may directly be passed to a kernel:: ary = cl.fsvm_empty(ctx, 1000, np.float32) assert isinstance(ary, np.ndarray) prg.twice(queue, ary.shape, None, cl.SVM(ary)) queue.finish() # synchronize print(ary) # access from host Observe how mapping (as needed in coarse-grain SVM) is no longer necessary. * any :class:`numpy.ndarray` (or other Python object with a buffer interface) if the implementation supports fine-grained *system* shared virtual memory. This is how plain :mod:`numpy` arrays may directly be passed to a kernel:: ary = np.zeros(1000, np.float32) prg.twice(queue, ary.shape, None, cl.SVM(ary)) queue.finish() # synchronize print(ary) # access from host * coarse-grain shared memory as returned by (e.g.) :func:`csvm_empty` for any implementation of OpenCL 2.0. .. note:: Applications making use of coarse-grain SVM may be better served by opaque-style SVM. See :ref:`opaque-svm`. This is how coarse-grain SVM may be used from both host and device:: svm_ary = cl.SVM( cl.csvm_empty(ctx, 1000, np.float32, alignment=64)) assert isinstance(svm_ary.mem, np.ndarray) with svm_ary.map_rw(queue) as ary: ary.fill(17) # use from host prg.twice(queue, svm_ary.mem.shape, None, svm_ary) Coarse-grain shared-memory *must* be mapped into host address space using :meth:`~SVMPointer.map` before being accessed through the :mod:`numpy` interface. .. note:: This object merely serves as a 'tag' that changes the behavior of functions to which it is passed. It has no special management relationship to the memory it tags. For example, it is permissible to grab a :class:`numpy.ndarray` out of :attr:`SVM.mem` of one :class:`SVM` instance and use the array to construct another. Neither of the tags need to be kept alive. .. versionadded:: 2016.2 .. attribute:: mem The wrapped object. .. automethod:: __init__ .. automethod:: map .. automethod:: map_ro .. automethod:: map_rw """ # }}} def svm_map(self, queue, flags, is_blocking=True, wait_for=None): """ :arg is_blocking: If *False*, subsequent code must wait on :attr:`SVMMap.event` in the returned object before accessing the mapped memory. :arg flags: a combination of :class:`pyopencl.map_flags`. :returns: an :class:`SVMMap` instance This differs from the inherited :class:`SVMPointer.map` in that no size can be specified, and that :attr:`mem` is the exact array produced when the :class:`SVMMap` is used as a context manager. |std-enqueue-blurb| """ return SVMMap( self, self.mem, queue, _cl._enqueue_svm_map(queue, is_blocking, flags, self, wait_for)) def svm_map_ro(self, queue, is_blocking=True, wait_for=None): """Like :meth:`map`, but with *flags* set for a read-only map.""" return self.map(queue, map_flags.READ, is_blocking=is_blocking, wait_for=wait_for) def svm_map_rw(self, queue, is_blocking=True, wait_for=None): """Like :meth:`map`, but with *flags* set for a read-only map.""" return self.map(queue, map_flags.READ | map_flags.WRITE, is_blocking=is_blocking, wait_for=wait_for) def svm__enqueue_unmap(self, queue, wait_for=None): return _cl._enqueue_svm_unmap(queue, self, wait_for) if get_cl_header_version() >= (2, 0): SVM.map = svm_map SVM.map_ro = svm_map_ro SVM.map_rw = svm_map_rw SVM._enqueue_unmap = svm__enqueue_unmap # }}} # ORDER DEPENDENCY: Some of the above may override get_info, the effect needs # to be visible through the attributes. So get_info attr creation needs to happen # after the overriding is complete. cls_to_info_cls = { _cl.Platform: (_cl.Platform.get_info, _cl.platform_info, []), _cl.Device: (_cl.Device.get_info, _cl.device_info, ["PLATFORM", "MAX_WORK_GROUP_SIZE", "MAX_COMPUTE_UNITS"]), _cl.Context: (_cl.Context.get_info, _cl.context_info, []), _cl.CommandQueue: (_cl.CommandQueue.get_info, _cl.command_queue_info, ["CONTEXT", "DEVICE"]), _cl.Event: (_cl.Event.get_info, _cl.event_info, []), _cl.MemoryObjectHolder: (MemoryObjectHolder.get_info, _cl.mem_info, []), Image: (_cl.Image.get_image_info, _cl.image_info, []), Pipe: (_cl.Pipe.get_pipe_info, _cl.pipe_info, []), Program: (Program.get_info, _cl.program_info, []), Kernel: (Kernel.get_info, _cl.kernel_info, []), _cl.Sampler: (Sampler.get_info, _cl.sampler_info, []), } def to_string(cls, value, default_format=None): if cls._is_bitfield: names = [] for name in dir(cls): attr = getattr(cls, name) if not isinstance(attr, int): continue if attr == value or attr & value: names.append(name) if names: return " | ".join(names) else: for name in dir(cls): if (not name.startswith("_") and getattr(cls, name) == value): return name if default_format is None: raise ValueError("a name for value %d was not found in %s" % (value, cls.__name__)) else: return default_format % value for cls in CONSTANT_CLASSES: cls._is_bitfield = cls in BITFIELD_CONSTANT_CLASSES cls.to_string = classmethod(to_string) # {{{ get_info attributes ------------------------------------------------- def make_getinfo(info_method, info_name, info_attr): def result(self): return info_method(self, info_attr) return property(result) def make_cacheable_getinfo(info_method, info_name, cache_attr, info_attr): def result(self): try: return getattr(self, cache_attr) except AttributeError: pass result = info_method(self, info_attr) setattr(self, cache_attr, result) return result return property(result) for cls, (info_method, info_class, cacheable_attrs) \ in cls_to_info_cls.items(): for info_name, _info_value in info_class.__dict__.items(): if info_name == "to_string" or info_name.startswith("_"): continue info_lower = info_name.lower() info_constant = getattr(info_class, info_name) if info_name in cacheable_attrs: cache_attr = intern("_info_cache_"+info_lower) setattr(cls, info_lower, make_cacheable_getinfo( info_method, info_lower, cache_attr, info_constant)) else: setattr(cls, info_lower, make_getinfo( info_method, info_name, info_constant)) # }}} if _cl.have_gl(): def gl_object_get_gl_object(self): return self.get_gl_object_info()[1] GLBuffer.gl_object = property(gl_object_get_gl_object) GLTexture.gl_object = property(gl_object_get_gl_object) _add_functionality() # }}} # {{{ _OverriddenArrayInterfaceSVMAllocation if get_cl_header_version() >= (2, 0): class _OverriddenArrayInterfaceSVMAllocation(SVMAllocation): def __init__(self, ctx, size, alignment, flags, *, _interface, queue=None): """ :arg ctx: a :class:`Context` :arg flags: some of :class:`svm_mem_flags`. """ super().__init__(ctx, size, alignment, flags, queue) # mem_flags.READ_ONLY applies to kernels, not the host read_write = True _interface["data"] = (int(self.svm_ptr), not read_write) self.__array_interface__ = _interface # }}} # {{{ create_image def create_image(context, flags, format, shape=None, pitches=None, hostbuf=None, is_array=False, buffer=None) -> Image: """ See :class:`mem_flags` for values of *flags*. *shape* is a 2- or 3-tuple. *format* is an instance of :class:`ImageFormat`. *pitches* is a 1-tuple for 2D images and a 2-tuple for 3D images, indicating the distance in bytes from one scan line to the next, and from one 2D image slice to the next. If *hostbuf* is given and *shape* is *None*, then *hostbuf.shape* is used as the *shape* parameter. :class:`Image` inherits from :class:`MemoryObject`. .. note:: If you want to load images from :class:`numpy.ndarray` instances or read images back into them, be aware that OpenCL images expect the *x* dimension to vary fastest, whereas in the default (C) order of :mod:`numpy` arrays, the last index varies fastest. If your array is arranged in the wrong order in memory, there are two possible fixes for this: * Convert the array to Fortran (column-major) order using :func:`numpy.asarray`. * Pass *ary.T.copy()* to the image creation function. .. versionadded:: 2024.3 """ return Image(context, flags, format, shape=shape, pitches=pitches, hostbuf=hostbuf, is_array=is_array, buffer=buffer, _through_create_image=True) # }}} # {{{ create_some_context def choose_devices(interactive: bool | None = None, answers: list[str] | None = None) -> list[Device]: """ Choose :class:`Device` instances 'somehow'. :arg interactive: If multiple choices for platform and/or device exist, *interactive* is ``True`` (or ``None`` and ``sys.stdin.isatty()`` returns ``True``), then the user is queried about which device should be chosen. Otherwise, a device is chosen in an implementation-defined manner. :arg answers: A sequence of strings that will be used to answer the platform/device selection questions. :returns: a list of :class:`Device` instances. """ if answers is None: if "PYOPENCL_CTX" in os.environ: ctx_spec = os.environ["PYOPENCL_CTX"] answers = ctx_spec.split(":") if "PYOPENCL_TEST" in os.environ: from pyopencl.tools import get_test_platforms_and_devices for _plat, devs in get_test_platforms_and_devices(): for dev in devs: return [dev] if answers is not None: pre_provided_answers = answers answers = answers[:] else: pre_provided_answers = None user_inputs = [] if interactive is None: interactive = True try: if not sys.stdin.isatty(): interactive = False except Exception: interactive = False def cc_print(s): if interactive: print(s) def get_input(prompt): if answers: return str(answers.pop(0)) elif not interactive: return "" else: user_input = input(prompt) user_inputs.append(user_input) return user_input # {{{ pick a platform platforms = get_platforms() if not platforms: raise Error("no platforms found") else: if not answers: cc_print("Choose platform:") for i, pf in enumerate(platforms): cc_print("[%d] %s" % (i, pf)) answer = get_input("Choice [0]:") if not answer: platform = platforms[0] else: platform = None try: int_choice = int(answer) except ValueError: pass else: if 0 <= int_choice < len(platforms): platform = platforms[int_choice] if platform is None: answer = answer.lower() for pf in platforms: if answer in pf.name.lower(): platform = pf if platform is None: raise RuntimeError("input did not match any platform") # }}} # {{{ pick a device devices = platform.get_devices() def parse_device(choice): try: int_choice = int(choice) except ValueError: pass else: if 0 <= int_choice < len(devices): return devices[int_choice] choice = choice.lower() for dev in devices: if choice in dev.name.lower(): return dev raise RuntimeError("input did not match any device") if not devices: raise Error("no devices found") elif len(devices) == 1 and not answers: cc_print(f"Choosing only available device: {devices[0]}") pass else: if not answers: cc_print("Choose device(s):") for i, dev in enumerate(devices): cc_print("[%d] %s" % (i, dev)) answer = get_input("Choice, comma-separated [0]:") if not answer: devices = [devices[0]] else: devices = [parse_device(i) for i in answer.split(",")] # }}} if user_inputs: if pre_provided_answers is not None: user_inputs = pre_provided_answers + user_inputs cc_print("Set the environment variable PYOPENCL_CTX='%s' to " "avoid being asked again." % ":".join(user_inputs)) if answers: raise RuntimeError("not all provided choices were used by " "choose_devices. (left over: '%s')" % ":".join(answers)) return devices def create_some_context(interactive: bool | None = None, answers: list[str] | None = None) -> Context: """ Create a :class:`Context` 'somehow'. :arg interactive: If multiple choices for platform and/or device exist, *interactive* is ``True`` (or ``None`` and ``sys.stdin.isatty()`` returns ``True``), then the user is queried about which device should be chosen. Otherwise, a device is chosen in an implementation-defined manner. :arg answers: A sequence of strings that will be used to answer the platform/device selection questions. :returns: an instance of :class:`Context`. """ devices = choose_devices(interactive, answers) return Context(devices) _csc = create_some_context # }}} # {{{ SVMMap class SVMMap: """ Returned by :func:`SVMPointer.map` and :func:`SVM.map`. This class may also be used as a context manager in a ``with`` statement. :meth:`release` will be called upon exit from the ``with`` region. The value returned to the ``as`` part of the context manager is the mapped Python object (e.g. a :mod:`numpy` array). .. versionadded:: 2016.2 .. property:: event The :class:`Event` returned when mapping the memory. .. automethod:: release """ def __init__(self, svm, array, queue, event): self.svm = svm self.array = array self.queue = queue self.event = event def __del__(self): if self.svm is not None: self.release() def __enter__(self): return self.array def __exit__(self, exc_type, exc_val, exc_tb): self.release() def release(self, queue=None, wait_for=None): """ :arg queue: a :class:`pyopencl.CommandQueue`. Defaults to the one with which the map was created, if not specified. :returns: a :class:`pyopencl.Event` |std-enqueue-blurb| """ evt = self.svm._enqueue_unmap(self.queue) self.svm = None return evt # }}} # {{{ enqueue_copy _IMAGE_MEM_OBJ_TYPES = [mem_object_type.IMAGE2D, mem_object_type.IMAGE3D] if get_cl_header_version() >= (1, 2): _IMAGE_MEM_OBJ_TYPES.append(mem_object_type.IMAGE2D_ARRAY) def enqueue_copy(queue, dest, src, **kwargs): """Copy from :class:`Image`, :class:`Buffer` or the host to :class:`Image`, :class:`Buffer` or the host. (Note: host-to-host copies are unsupported.) The following keyword arguments are available: :arg wait_for: (optional, default empty) :arg is_blocking: Wait for completion. Defaults to *True*. (Available on any copy involving host memory) :return: A :class:`NannyEvent` if the transfer involved a host-side buffer, otherwise an :class:`Event`. .. note:: Be aware that the deletion of the :class:`NannyEvent` that is returned by the function if the transfer involved a host-side buffer will block until the transfer is complete, so be sure to keep a reference to this :class:`Event` until the transfer has completed. .. note:: Two types of 'buffer' occur in the arguments to this function, :class:`Buffer` and 'host-side buffers'. The latter are defined by Python and commonly called `buffer objects `__. :mod:`numpy` arrays are a very common example. Make sure to always be clear on whether a :class:`Buffer` or a Python buffer object is needed. .. ------------------------------------------------------------------------ .. rubric :: Transfer :class:`Buffer` ↔ host .. ------------------------------------------------------------------------ :arg src_offset: offset in bytes (optional) May only be nonzero if applied on the device side. :arg dst_offset: offset in bytes (optional) May only be nonzero if applied on the device side. .. note:: The size of the transfer is controlled by the size of the of the host-side buffer. If the host-side buffer is a :class:`numpy.ndarray`, you can control the transfer size by transferring into a smaller 'view' of the target array, like this:: cl.enqueue_copy(queue, large_dest_numpy_array[:15], src_buffer) .. ------------------------------------------------------------------------ .. rubric :: Transfer :class:`Buffer` ↔ :class:`Buffer` .. ------------------------------------------------------------------------ :arg byte_count: (optional) If not specified, defaults to the size of the source in versions 2012.x and earlier, and to the minimum of the size of the source and target from 2013.1 on. :arg src_offset: (optional) :arg dst_offset: (optional) .. ------------------------------------------------------------------------ .. rubric :: Rectangular :class:`Buffer` ↔ host transfers (CL 1.1 and newer) .. ------------------------------------------------------------------------ :arg buffer_origin: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) :arg host_origin: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) :arg region: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) :arg buffer_pitches: :class:`tuple` of :class:`int` of length two or shorter. (optional, "tightly-packed" if unspecified) :arg host_pitches: :class:`tuple` of :class:`int` of length two or shorter. (optional, "tightly-packed" if unspecified) .. ------------------------------------------------------------------------ .. rubric :: Rectangular :class:`Buffer` ↔ :class:`Buffer` transfers (CL 1.1 and newer) .. ------------------------------------------------------------------------ :arg src_origin: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) :arg dst_origin: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) :arg region: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) :arg src_pitches: :class:`tuple` of :class:`int` of length two or shorter. (optional, "tightly-packed" if unspecified) :arg dst_pitches: :class:`tuple` of :class:`int` of length two or shorter. (optional, "tightly-packed" if unspecified) .. ------------------------------------------------------------------------ .. rubric :: Transfer :class:`Image` ↔ host .. ------------------------------------------------------------------------ :arg origin: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) :arg region: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) :arg pitches: :class:`tuple` of :class:`int` of length two or shorter. (optional) .. ------------------------------------------------------------------------ .. rubric :: Transfer :class:`Buffer` ↔ :class:`Image` .. ------------------------------------------------------------------------ :arg offset: offset in buffer (mandatory) :arg origin: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) :arg region: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) .. ------------------------------------------------------------------------ .. rubric :: Transfer :class:`Image` ↔ :class:`Image` .. ------------------------------------------------------------------------ :arg src_origin: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) :arg dest_origin: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) :arg region: :class:`tuple` of :class:`int` of length three or shorter. (mandatory) .. ------------------------------------------------------------------------ .. rubric :: Transfer :class:`SVMPointer`/host ↔ :class:`SVMPointer`/host .. ------------------------------------------------------------------------ :arg byte_count: (optional) If not specified, defaults to the size of the source in versions 2012.x and earlier, and to the minimum of the size of the source and target from 2013.1 on. |std-enqueue-blurb| .. versionadded:: 2011.1 """ if isinstance(dest, MemoryObjectHolder): if dest.type == mem_object_type.BUFFER: if isinstance(src, MemoryObjectHolder): if src.type == mem_object_type.BUFFER: # {{{ buffer -> buffer if "src_origin" in kwargs: # rectangular return _cl._enqueue_copy_buffer_rect( queue, src, dest, **kwargs) else: # linear dest_offset = kwargs.pop("dest_offset", None) if dest_offset is not None: if "dst_offset" in kwargs: raise TypeError("may not specify both 'dst_offset' " "and 'dest_offset'") warn("The 'dest_offset' argument of enqueue_copy " "is deprecated. Use 'dst_offset' instead. " "'dest_offset' will stop working in 2023.x.", DeprecationWarning, stacklevel=2) kwargs["dst_offset"] = dest_offset return _cl._enqueue_copy_buffer(queue, src, dest, **kwargs) # }}} elif src.type in _IMAGE_MEM_OBJ_TYPES: return _cl._enqueue_copy_image_to_buffer( queue, src, dest, **kwargs) else: raise ValueError("invalid src mem object type") else: # {{{ host -> buffer if "buffer_origin" in kwargs: return _cl._enqueue_write_buffer_rect(queue, dest, src, **kwargs) else: device_offset = kwargs.pop("device_offset", None) if device_offset is not None: if "dst_offset" in kwargs: raise TypeError("may not specify both 'device_offset' " "and 'dst_offset'") warn("The 'device_offset' argument of enqueue_copy " "is deprecated. Use 'dst_offset' instead. " "'dst_offset' will stop working in 2023.x.", DeprecationWarning, stacklevel=2) kwargs["dst_offset"] = device_offset return _cl._enqueue_write_buffer(queue, dest, src, **kwargs) # }}} elif dest.type in _IMAGE_MEM_OBJ_TYPES: # {{{ ... -> image if isinstance(src, MemoryObjectHolder): if src.type == mem_object_type.BUFFER: return _cl._enqueue_copy_buffer_to_image( queue, src, dest, **kwargs) elif src.type in _IMAGE_MEM_OBJ_TYPES: return _cl._enqueue_copy_image(queue, src, dest, **kwargs) else: raise ValueError("invalid src mem object type") else: # assume from-host origin = kwargs.pop("origin") region = kwargs.pop("region") pitches = kwargs.pop("pitches", (0, 0)) if len(pitches) == 1: kwargs["row_pitch"], = pitches else: kwargs["row_pitch"], kwargs["slice_pitch"] = pitches return _cl._enqueue_write_image( queue, dest, origin, region, src, **kwargs) # }}} else: raise ValueError("invalid dest mem object type") elif get_cl_header_version() >= (2, 0) and isinstance(dest, SVMPointer): # {{{ ... -> SVM if not isinstance(src, SVMPointer): src = SVM(src) is_blocking = kwargs.pop("is_blocking", True) # These are NOT documented. They only support consistency with the # Buffer-based API for the sake of the Array. if kwargs.pop("src_offset", 0) != 0: raise ValueError("src_offset must be 0") if kwargs.pop("dst_offset", 0) != 0: raise ValueError("dst_offset must be 0") return _cl._enqueue_svm_memcpy(queue, is_blocking, dest, src, **kwargs) # }}} else: # assume to-host if isinstance(src, MemoryObjectHolder): if src.type == mem_object_type.BUFFER: if "buffer_origin" in kwargs: return _cl._enqueue_read_buffer_rect(queue, src, dest, **kwargs) else: device_offset = kwargs.pop("device_offset", None) if device_offset is not None: if "src_offset" in kwargs: raise TypeError("may not specify both 'device_offset' " "and 'src_offset'") warn("The 'device_offset' argument of enqueue_copy " "is deprecated. Use 'src_offset' instead. " "'dst_offset' will stop working in 2023.x.", DeprecationWarning, stacklevel=2) kwargs["src_offset"] = device_offset return _cl._enqueue_read_buffer(queue, src, dest, **kwargs) elif src.type in _IMAGE_MEM_OBJ_TYPES: origin = kwargs.pop("origin") region = kwargs.pop("region") pitches = kwargs.pop("pitches", (0, 0)) if len(pitches) == 1: kwargs["row_pitch"], = pitches else: kwargs["row_pitch"], kwargs["slice_pitch"] = pitches return _cl._enqueue_read_image( queue, src, origin, region, dest, **kwargs) else: raise ValueError("invalid src mem object type") elif isinstance(src, SVMPointer): # {{{ svm -> host # dest is not a SVM instance, otherwise we'd be in the branch above # This is NOT documented. They only support consistency with the # Buffer-based API for the sake of the Array. if kwargs.pop("src_offset", 0) != 0: raise ValueError("src_offset must be 0") is_blocking = kwargs.pop("is_blocking", True) return _cl._enqueue_svm_memcpy( queue, is_blocking, SVM(dest), src, **kwargs) # }}} else: # assume from-host raise TypeError("enqueue_copy cannot perform host-to-host transfers") # }}} # {{{ enqueue_fill def enqueue_fill(queue: CommandQueue, dest: MemoryObject | SVMPointer, pattern: Any, size: int, *, offset: int = 0, wait_for: Sequence[Event] | None = None) -> Event: """ .. versionadded:: 2022.2 """ if isinstance(dest, MemoryObjectHolder): return enqueue_fill_buffer(queue, dest, pattern, offset, size, wait_for) elif isinstance(dest, SVMPointer): if offset: raise NotImplementedError("enqueue_fill with SVM does not yet support " "offsets") return enqueue_svm_memfill(queue, dest, pattern, size, wait_for) else: raise TypeError(f"enqueue_fill does not know how to fill '{type(dest)}'") # }}} # {{{ image creation DTYPE_TO_CHANNEL_TYPE = { np.dtype(np.float32): channel_type.FLOAT, np.dtype(np.int16): channel_type.SIGNED_INT16, np.dtype(np.int32): channel_type.SIGNED_INT32, np.dtype(np.int8): channel_type.SIGNED_INT8, np.dtype(np.uint16): channel_type.UNSIGNED_INT16, np.dtype(np.uint32): channel_type.UNSIGNED_INT32, np.dtype(np.uint8): channel_type.UNSIGNED_INT8, } try: np.float16 # noqa: B018 except Exception: pass else: DTYPE_TO_CHANNEL_TYPE[np.dtype(np.float16)] = channel_type.HALF_FLOAT DTYPE_TO_CHANNEL_TYPE_NORM = { np.dtype(np.int16): channel_type.SNORM_INT16, np.dtype(np.int8): channel_type.SNORM_INT8, np.dtype(np.uint16): channel_type.UNORM_INT16, np.dtype(np.uint8): channel_type.UNORM_INT8, } def image_from_array(ctx, ary, num_channels=None, mode="r", norm_int=False): if not ary.flags.c_contiguous: raise ValueError("array must be C-contiguous") dtype = ary.dtype if num_channels is None: try: dtype, num_channels = \ pyopencl.cltypes.vec_type_to_scalar_and_count[dtype] except KeyError: # It must be a scalar type then. num_channels = 1 shape = ary.shape strides = ary.strides elif num_channels == 1: shape = ary.shape strides = ary.strides else: if ary.shape[-1] != num_channels: raise RuntimeError("last dimension must be equal to number of channels") shape = ary.shape[:-1] strides = ary.strides[:-1] if mode == "r": mode_flags = mem_flags.READ_ONLY elif mode == "w": mode_flags = mem_flags.WRITE_ONLY else: raise ValueError("invalid value '%s' for 'mode'" % mode) img_format = { 1: channel_order.R, 2: channel_order.RG, 3: channel_order.RGB, 4: channel_order.RGBA, }[num_channels] assert ary.strides[-1] == ary.dtype.itemsize if norm_int: channel_type = DTYPE_TO_CHANNEL_TYPE_NORM[dtype] else: channel_type = DTYPE_TO_CHANNEL_TYPE[dtype] return create_image(ctx, mode_flags | mem_flags.COPY_HOST_PTR, ImageFormat(img_format, channel_type), shape=shape[::-1], pitches=strides[::-1][1:], hostbuf=ary) # }}} # {{{ enqueue_* compatibility shims def enqueue_marker(queue, wait_for=None): if queue._get_cl_version() >= (1, 2) and get_cl_header_version() >= (1, 2): return _cl._enqueue_marker_with_wait_list(queue, wait_for) else: if wait_for: _cl._enqueue_wait_for_events(queue, wait_for) return _cl._enqueue_marker(queue) def enqueue_barrier(queue, wait_for=None): if queue._get_cl_version() >= (1, 2) and get_cl_header_version() >= (1, 2): return _cl._enqueue_barrier_with_wait_list(queue, wait_for) else: _cl._enqueue_barrier(queue) if wait_for: _cl._enqueue_wait_for_events(queue, wait_for) return _cl._enqueue_marker(queue) def enqueue_fill_buffer(queue, mem, pattern, offset, size, wait_for=None): if not (queue._get_cl_version() >= (1, 2) and get_cl_header_version() >= (1, 2)): warn( "The context for this queue does not declare OpenCL 1.2 support, so " "the next thing you might see is a crash", stacklevel=2) if _PYPY and isinstance(pattern, np.generic): pattern = np.asarray(pattern) return _cl._enqueue_fill_buffer(queue, mem, pattern, offset, size, wait_for) # }}} # {{{ numpy-like svm allocation def enqueue_svm_memfill(queue, dest, pattern, byte_count=None, wait_for=None): """Fill shared virtual memory with a pattern. :arg dest: a Python buffer object, or any implementation of :class:`SVMPointer`. :arg pattern: a Python buffer object (e.g. a :class:`numpy.ndarray` with the fill pattern to be used. :arg byte_count: The size of the memory to be fill. Defaults to the entirety of *dest*. |std-enqueue-blurb| .. versionadded:: 2016.2 """ if not isinstance(dest, SVMPointer): dest = SVM(dest) return _cl._enqueue_svm_memfill( queue, dest, pattern, byte_count=byte_count, wait_for=wait_for) def enqueue_svm_migratemem(queue, svms, flags, wait_for=None): """ :arg svms: a collection of Python buffer objects (e.g. :mod:`numpy` arrays), or any implementation of :class:`SVMPointer`. :arg flags: a combination of :class:`mem_migration_flags` |std-enqueue-blurb| .. versionadded:: 2016.2 This function requires OpenCL 2.1. """ return _cl._enqueue_svm_migratemem(queue, svms, flags, wait_for) def svm_empty(ctx, flags, shape, dtype, order="C", alignment=None, queue=None): """Allocate an empty :class:`numpy.ndarray` of the given *shape*, *dtype* and *order*. (See :func:`numpy.empty` for the meaning of these arguments.) The array will be allocated in shared virtual memory belonging to *ctx*. :arg ctx: a :class:`Context` :arg flags: a combination of flags from :class:`svm_mem_flags`. :arg alignment: the number of bytes to which the beginning of the memory is aligned. Defaults to the :attr:`numpy.dtype.itemsize` of *dtype*. :returns: a :class:`numpy.ndarray` whose :attr:`numpy.ndarray.base` attribute is a :class:`SVMAllocation`. To pass the resulting array to an OpenCL kernel or :func:`enqueue_copy`, you will likely want to wrap the returned array in an :class:`SVM` tag. .. versionadded:: 2016.2 .. versionchanged:: 2022.2 *queue* argument added. """ dtype = np.dtype(dtype) try: s = 1 for dim in shape: s *= dim except TypeError as err: admissible_types = (int, np.integer) if not isinstance(shape, admissible_types): raise TypeError("shape must either be iterable or " "castable to an integer") from err s = shape shape = (shape,) itemsize = dtype.itemsize nbytes = s * itemsize from pyopencl.compyte.array import c_contiguous_strides, f_contiguous_strides if order in "fF": strides = f_contiguous_strides(itemsize, shape) elif order in "cC": strides = c_contiguous_strides(itemsize, shape) else: raise ValueError("order not recognized: %s" % order) descr = dtype.descr interface = { "version": 3, "shape": shape, "strides": strides, } if len(descr) == 1: interface["typestr"] = descr[0][1] else: interface["typestr"] = "V%d" % itemsize interface["descr"] = descr if alignment is None: alignment = itemsize svm_alloc = _OverriddenArrayInterfaceSVMAllocation( ctx, nbytes, alignment, flags, _interface=interface, queue=queue) return np.asarray(svm_alloc) def svm_empty_like(ctx, flags, ary, alignment=None): """Allocate an empty :class:`numpy.ndarray` like the existing :class:`numpy.ndarray` *ary*. The array will be allocated in shared virtual memory belonging to *ctx*. :arg ctx: a :class:`Context` :arg flags: a combination of flags from :class:`svm_mem_flags`. :arg alignment: the number of bytes to which the beginning of the memory is aligned. Defaults to the :attr:`numpy.dtype.itemsize` of *dtype*. :returns: a :class:`numpy.ndarray` whose :attr:`numpy.ndarray.base` attribute is a :class:`SVMAllocation`. To pass the resulting array to an OpenCL kernel or :func:`enqueue_copy`, you will likely want to wrap the returned array in an :class:`SVM` tag. .. versionadded:: 2016.2 """ if ary.flags.c_contiguous: order = "C" elif ary.flags.f_contiguous: order = "F" else: raise ValueError("array is neither C- nor Fortran-contiguous") return svm_empty(ctx, flags, ary.shape, ary.dtype, order, alignment=alignment) def csvm_empty(ctx, shape, dtype, order="C", alignment=None): """ Like :func:`svm_empty`, but with *flags* set for a coarse-grain read-write buffer. .. versionadded:: 2016.2 """ return svm_empty(ctx, svm_mem_flags.READ_WRITE, shape, dtype, order, alignment) def csvm_empty_like(ctx, ary, alignment=None): """ Like :func:`svm_empty_like`, but with *flags* set for a coarse-grain read-write buffer. .. versionadded:: 2016.2 """ return svm_empty_like(ctx, svm_mem_flags.READ_WRITE, ary) def fsvm_empty(ctx, shape, dtype, order="C", alignment=None): """ Like :func:`svm_empty`, but with *flags* set for a fine-grain read-write buffer. .. versionadded:: 2016.2 """ return svm_empty(ctx, svm_mem_flags.READ_WRITE | svm_mem_flags.SVM_FINE_GRAIN_BUFFER, shape, dtype, order, alignment) def fsvm_empty_like(ctx, ary, alignment=None): """ Like :func:`svm_empty_like`, but with *flags* set for a fine-grain read-write buffer. .. versionadded:: 2016.2 """ return svm_empty_like( ctx, svm_mem_flags.READ_WRITE | svm_mem_flags.SVM_FINE_GRAIN_BUFFER, ary) # }}} _KERNEL_ARG_CLASSES: tuple[type, ...] = ( MemoryObjectHolder, Sampler, CommandQueue, LocalMemory, ) if get_cl_header_version() >= (2, 0): _KERNEL_ARG_CLASSES = (*_KERNEL_ARG_CLASSES, SVM) # vim: foldmethod=marker pyopencl-2025.1/pyopencl/_cluda.py0000644000000000000000000000403214332717401014015 0ustar00__copyright__ = "Copyright (C) 2009 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ CLUDA_PREAMBLE = """ #define local_barrier() barrier(CLK_LOCAL_MEM_FENCE); #define WITHIN_KERNEL /* empty */ #define KERNEL __kernel #define GLOBAL_MEM __global #define LOCAL_MEM __local #define LOCAL_MEM_ARG __local #define REQD_WG_SIZE(X,Y,Z) __attribute__((reqd_work_group_size(X, Y, Z))) #define LID_0 ((ptrdiff_t) get_local_id(0)) #define LID_1 ((ptrdiff_t) get_local_id(1)) #define LID_2 ((ptrdiff_t) get_local_id(2)) #define GID_0 ((ptrdiff_t) get_group_id(0)) #define GID_1 ((ptrdiff_t) get_group_id(1)) #define GID_2 ((ptrdiff_t) get_group_id(2)) #define LDIM_0 ((ptrdiff_t) get_local_size(0)) #define LDIM_1 ((ptrdiff_t) get_local_size(1)) #define LDIM_2 ((ptrdiff_t) get_local_size(2)) #define GDIM_0 ((ptrdiff_t) get_num_groups(0)) #define GDIM_1 ((ptrdiff_t) get_num_groups(1)) #define GDIM_2 ((ptrdiff_t) get_num_groups(2)) % if double_support: #if __OPENCL_C_VERSION__ < 120 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif % endif """ pyopencl-2025.1/pyopencl/_mymako.py0000644000000000000000000000115414332717401014224 0ustar00try: import mako.template # noqa: F401 except ImportError as err: raise ImportError( "Some of PyOpenCL's facilities require the Mako templating engine.\n" "You or a piece of software you have used has tried to call such a\n" "part of PyOpenCL, but there was a problem importing Mako.\n\n" "You may install mako now by typing one of:\n" "- easy_install Mako\n" "- pip install Mako\n" "- aptitude install python-mako\n" "\nor whatever else is appropriate for your system.") from err from mako import * # noqa: F403 pyopencl-2025.1/pyopencl/algorithm.py0000644000000000000000000014413514332717401014565 0ustar00"""Algorithms built on scans.""" __copyright__ = """ Copyright 2011-2012 Andreas Kloeckner Copyright 2017 Hao Gao """ __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ from dataclasses import dataclass from typing import Optional import numpy as np from mako.template import Template from pytools import memoize, memoize_method import pyopencl as cl import pyopencl.array from pyopencl.elementwise import ElementwiseKernel from pyopencl.scan import GenericScanKernel, ScanTemplate from pyopencl.tools import dtype_to_ctype, get_arg_offset_adjuster_code # {{{ "extra args" handling utility def _extract_extra_args_types_values(extra_args): if extra_args is None: extra_args = [] from pyopencl.tools import ScalarArg, VectorArg extra_args_types = [] extra_args_values = [] extra_wait_for = [] for name, val in extra_args: if isinstance(val, cl.array.Array): extra_args_types.append(VectorArg(val.dtype, name, with_offset=False)) extra_args_values.append(val) extra_wait_for.extend(val.events) elif isinstance(val, np.generic): extra_args_types.append(ScalarArg(val.dtype, name)) extra_args_values.append(val) else: raise RuntimeError("argument '%d' not understood" % name) return tuple(extra_args_types), extra_args_values, extra_wait_for # }}} # {{{ copy_if _copy_if_template = ScanTemplate( arguments="item_t *ary, item_t *out, scan_t *count", input_expr="(%(predicate)s) ? 1 : 0", scan_expr="a+b", neutral="0", output_statement=""" if (prev_item != item) out[item-1] = ary[i]; if (i+1 == N) *count = item; """, template_processor="printf") def copy_if(ary, predicate, extra_args=None, preamble="", queue=None, wait_for=None): """Copy the elements of *ary* satisfying *predicate* to an output array. :arg predicate: a C expression evaluating to a ``bool``, represented as a string. The value to test is available as ``ary[i]``, and if the expression evaluates to ``true``, then this value ends up in the output. :arg extra_args: |scan_extra_args| :arg preamble: |preamble| :arg wait_for: |explain-waitfor| :returns: a tuple *(out, count, event)* where *out* is the output array, *count* is an on-device scalar (fetch to host with ``count.get()``) indicating how many elements satisfied *predicate*, and *event* is a :class:`pyopencl.Event` for dependency management. *out* is allocated to the same length as *ary*, but only the first *count* entries carry meaning. .. versionadded:: 2013.1 """ if len(ary) > np.iinfo(np.int32).max: scan_dtype = np.int64 else: scan_dtype = np.int32 if wait_for is None: wait_for = [] extra_args_types, extra_args_values, extra_wait_for = \ _extract_extra_args_types_values(extra_args) wait_for = wait_for + extra_wait_for knl = _copy_if_template.build(ary.context, type_aliases=(("scan_t", scan_dtype), ("item_t", ary.dtype)), var_values=(("predicate", predicate),), more_preamble=preamble, more_arguments=extra_args_types) out = cl.array.empty_like(ary) count = ary._new_with_changes(data=None, offset=0, shape=(), strides=(), dtype=scan_dtype) evt = knl(ary, out, count, *extra_args_values, queue=queue, wait_for=wait_for) return out, count, evt # }}} # {{{ remove_if def remove_if(ary, predicate, extra_args=None, preamble="", queue=None, wait_for=None): """Copy the elements of *ary* not satisfying *predicate* to an output array. :arg predicate: a C expression evaluating to a ``bool``, represented as a string. The value to test is available as ``ary[i]``, and if the expression evaluates to ``false``, then this value ends up in the output. :arg extra_args: |scan_extra_args| :arg preamble: |preamble| :arg wait_for: |explain-waitfor| :returns: a tuple *(out, count, event)* where *out* is the output array, *count* is an on-device scalar (fetch to host with ``count.get()``) indicating how many elements did not satisfy *predicate*, and *event* is a :class:`pyopencl.Event` for dependency management. .. versionadded:: 2013.1 """ return copy_if(ary, "!(%s)" % predicate, extra_args=extra_args, preamble=preamble, queue=queue, wait_for=wait_for) # }}} # {{{ partition _partition_template = ScanTemplate( arguments=( "item_t *ary, item_t *out_true, item_t *out_false, " "scan_t *count_true"), input_expr="(%(predicate)s) ? 1 : 0", scan_expr="a+b", neutral="0", output_statement="""//CL// if (prev_item != item) out_true[item-1] = ary[i]; else out_false[i-item] = ary[i]; if (i+1 == N) *count_true = item; """, template_processor="printf") def partition(ary, predicate, extra_args=None, preamble="", queue=None, wait_for=None): """Copy the elements of *ary* into one of two arrays depending on whether they satisfy *predicate*. :arg predicate: a C expression evaluating to a ``bool``, represented as a string. The value to test is available as ``ary[i]``. :arg extra_args: |scan_extra_args| :arg preamble: |preamble| :arg wait_for: |explain-waitfor| :returns: a tuple *(out_true, out_false, count, event)* where *count* is an on-device scalar (fetch to host with ``count.get()``) indicating how many elements satisfied the predicate, and *event* is a :class:`pyopencl.Event` for dependency management. .. versionadded:: 2013.1 """ if len(ary) > np.iinfo(np.uint32).max: scan_dtype = np.uint64 else: scan_dtype = np.uint32 if wait_for is None: wait_for = [] extra_args_types, extra_args_values, extra_wait_for = \ _extract_extra_args_types_values(extra_args) wait_for = wait_for + extra_wait_for knl = _partition_template.build( ary.context, type_aliases=(("item_t", ary.dtype), ("scan_t", scan_dtype)), var_values=(("predicate", predicate),), more_preamble=preamble, more_arguments=extra_args_types) out_true = cl.array.empty_like(ary) out_false = cl.array.empty_like(ary) count = ary._new_with_changes(data=None, offset=0, shape=(), strides=(), dtype=scan_dtype) evt = knl(ary, out_true, out_false, count, *extra_args_values, queue=queue, wait_for=wait_for) return out_true, out_false, count, evt # }}} # {{{ unique _unique_template = ScanTemplate( arguments="item_t *ary, item_t *out, scan_t *count_unique", input_fetch_exprs=[ ("ary_im1", "ary", -1), ("ary_i", "ary", 0), ], input_expr="(i == 0) || (IS_EQUAL_EXPR(ary_im1, ary_i) ? 0 : 1)", scan_expr="a+b", neutral="0", output_statement=""" if (prev_item != item) out[item-1] = ary[i]; if (i+1 == N) *count_unique = item; """, preamble="#define IS_EQUAL_EXPR(a, b) %(macro_is_equal_expr)s\n", template_processor="printf") def unique(ary, is_equal_expr="a == b", extra_args=None, preamble="", queue=None, wait_for=None): """Copy the elements of *ary* into the output if *is_equal_expr*, applied to the array element and its predecessor, yields false. Works like the UNIX command :program:`uniq`, with a potentially custom comparison. This operation is often used on sorted sequences. :arg is_equal_expr: a C expression evaluating to a ``bool``, represented as a string. The elements being compared are available as ``a`` and ``b``. If this expression yields ``false``, the two are considered distinct. :arg extra_args: |scan_extra_args| :arg preamble: |preamble| :arg wait_for: |explain-waitfor| :returns: a tuple *(out, count, event)* where *out* is the output array, *count* is an on-device scalar (fetch to host with ``count.get()``) indicating how many elements satisfied the predicate, and *event* is a :class:`pyopencl.Event` for dependency management. .. versionadded:: 2013.1 """ if len(ary) > np.iinfo(np.uint32).max: scan_dtype = np.uint64 else: scan_dtype = np.uint32 if wait_for is None: wait_for = [] extra_args_types, extra_args_values, extra_wait_for = \ _extract_extra_args_types_values(extra_args) wait_for = wait_for + extra_wait_for knl = _unique_template.build( ary.context, type_aliases=(("item_t", ary.dtype), ("scan_t", scan_dtype)), var_values=(("macro_is_equal_expr", is_equal_expr),), more_preamble=preamble, more_arguments=extra_args_types) out = cl.array.empty_like(ary) count = ary._new_with_changes(data=None, offset=0, shape=(), strides=(), dtype=scan_dtype) evt = knl(ary, out, count, *extra_args_values, queue=queue, wait_for=wait_for) return out, count, evt # }}} # {{{ radix_sort def to_bin(n): # Py 2.5 has no built-in bin() digs = [] while n: digs.append(str(n % 2)) n >>= 1 return "".join(digs[::-1]) def _padded_bin(i, nbits): s = to_bin(i) while len(s) < nbits: s = "0" + s return s @memoize def _make_sort_scan_type(device, bits, index_dtype): name = "pyopencl_sort_scan_%s_%dbits_t" % ( index_dtype.type.__name__, bits) fields = [] for mnr in range(2**bits): fields.append(("c%s" % _padded_bin(mnr, bits), index_dtype)) dtype = np.dtype(fields) from pyopencl.tools import get_or_register_dtype, match_dtype_to_c_struct dtype, c_decl = match_dtype_to_c_struct(device, name, dtype) dtype = get_or_register_dtype(name, dtype) return name, dtype, c_decl # {{{ types, helpers preamble RADIX_SORT_PREAMBLE_TPL = Template(r"""//CL// typedef ${scan_ctype} scan_t; typedef ${key_ctype} key_t; typedef ${index_ctype} index_t; // #define DEBUG #ifdef DEBUG #define dbg_printf(ARGS) printf ARGS #else #define dbg_printf(ARGS) /* */ #endif index_t get_count(scan_t s, int mnr) { return ${get_count_branch("")}; } #define BIN_NR(key_arg) ((key_arg >> base_bit) & ${2**bits - 1}) """, strict_undefined=True) # }}} # {{{ scan helpers RADIX_SORT_SCAN_PREAMBLE_TPL = Template(r"""//CL// scan_t scan_t_neutral() { scan_t result; %for mnr in range(2**bits): result.c${padded_bin(mnr, bits)} = 0; %endfor return result; } // considers bits (base_bit+bits-1, ..., base_bit) scan_t scan_t_from_value( key_t key, int base_bit, int i ) { // extract relevant bit range key_t bin_nr = BIN_NR(key); dbg_printf(("i: %d key:%d bin_nr:%d\n", i, key, bin_nr)); scan_t result; %for mnr in range(2**bits): result.c${padded_bin(mnr, bits)} = (bin_nr == ${mnr}); %endfor return result; } scan_t scan_t_add(scan_t a, scan_t b, bool across_seg_boundary) { %for mnr in range(2**bits): <% field = "c"+padded_bin(mnr, bits) %> b.${field} = a.${field} + b.${field}; %endfor return b; } """, strict_undefined=True) RADIX_SORT_OUTPUT_STMT_TPL = Template(r"""//CL// { key_t key = ${key_expr}; key_t my_bin_nr = BIN_NR(key); index_t previous_bins_size = 0; %for mnr in range(2**bits): previous_bins_size += (my_bin_nr > ${mnr}) ? last_item.c${padded_bin(mnr, bits)} : 0; %endfor index_t tgt_idx = previous_bins_size + get_count(item, my_bin_nr) - 1; %for arg_name in sort_arg_names: sorted_${arg_name}[tgt_idx] = ${arg_name}[i]; %endfor } """, strict_undefined=True) # }}} # {{{ driver class RadixSort: """Provides a general `radix sort `__ on the compute device. .. seealso:: :class:`pyopencl.bitonic_sort.BitonicSort` .. versionadded:: 2013.1 """ def __init__(self, context, arguments, key_expr, sort_arg_names, bits_at_a_time=2, index_dtype=np.int32, key_dtype=np.uint32, scan_kernel=GenericScanKernel, options=None): """ :arg arguments: A string of comma-separated C argument declarations. If *arguments* is specified, then *input_expr* must also be specified. All types used here must be known to PyOpenCL. (see :func:`pyopencl.tools.get_or_register_dtype`). :arg key_expr: An integer-valued C expression returning the key based on which the sort is performed. The array index for which the key is to be computed is available as ``i``. The expression may refer to any of the *arguments*. :arg sort_arg_names: A list of argument names whose corresponding array arguments will be sorted according to *key_expr*. """ # {{{ arg processing from pyopencl.tools import parse_arg_list self.arguments = parse_arg_list(arguments) del arguments self.sort_arg_names = sort_arg_names self.bits = int(bits_at_a_time) self.index_dtype = np.dtype(index_dtype) self.key_dtype = np.dtype(key_dtype) self.options = options # }}} # {{{ kernel creation scan_ctype, scan_dtype, scan_t_cdecl = \ _make_sort_scan_type(context.devices[0], self.bits, self.index_dtype) from pyopencl.tools import ScalarArg, VectorArg scan_arguments = ( list(self.arguments) + [VectorArg(arg.dtype, "sorted_"+arg.name) for arg in self.arguments if arg.name in sort_arg_names] + [ScalarArg(np.int32, "base_bit")]) def get_count_branch(known_bits): if len(known_bits) == self.bits: return "s.c%s" % known_bits boundary_mnr = known_bits + "1" + (self.bits-len(known_bits)-1)*"0" return ("((mnr < {}) ? {} : {})".format( int(boundary_mnr, 2), get_count_branch(known_bits+"0"), get_count_branch(known_bits+"1"))) codegen_args = { "bits": self.bits, "key_ctype": dtype_to_ctype(self.key_dtype), "key_expr": key_expr, "index_ctype": dtype_to_ctype(self.index_dtype), "index_type_max": np.iinfo(self.index_dtype).max, "padded_bin": _padded_bin, "scan_ctype": scan_ctype, "sort_arg_names": sort_arg_names, "get_count_branch": get_count_branch, } preamble = scan_t_cdecl+RADIX_SORT_PREAMBLE_TPL.render(**codegen_args) scan_preamble = preamble \ + RADIX_SORT_SCAN_PREAMBLE_TPL.render(**codegen_args) self.scan_kernel = scan_kernel( context, scan_dtype, arguments=scan_arguments, input_expr="scan_t_from_value(%s, base_bit, i)" % key_expr, scan_expr="scan_t_add(a, b, across_seg_boundary)", neutral="scan_t_neutral()", output_statement=RADIX_SORT_OUTPUT_STMT_TPL.render(**codegen_args), preamble=scan_preamble, options=self.options) for i, arg in enumerate(self.arguments): if isinstance(arg, VectorArg): self.first_array_arg_idx = i # }}} def __call__(self, *args, **kwargs): """Run the radix sort. In addition to *args* which must match the *arguments* specification on the constructor, the following keyword arguments are supported: :arg key_bits: specify how many bits (starting from least-significant) there are in the key. :arg allocator: See the *allocator* argument of :func:`pyopencl.array.empty`. :arg queue: A :class:`pyopencl.CommandQueue`, defaulting to the one from the first argument array. :arg wait_for: |explain-waitfor| :returns: A tuple ``(sorted, event)``. *sorted* consists of sorted copies of the arrays named in *sorted_args*, in the order of that list. *event* is a :class:`pyopencl.Event` for dependency management. """ wait_for = kwargs.pop("wait_for", None) # {{{ run control key_bits = kwargs.pop("key_bits", None) if key_bits is None: key_bits = int(np.iinfo(self.key_dtype).bits) n = len(args[self.first_array_arg_idx]) allocator = kwargs.pop("allocator", None) if allocator is None: allocator = args[self.first_array_arg_idx].allocator queue = kwargs.pop("queue", None) if queue is None: queue = args[self.first_array_arg_idx].queue args = list(args) base_bit = 0 while base_bit < key_bits: sorted_args = [ cl.array.empty(queue, n, arg_descr.dtype, allocator=allocator) for arg_descr in self.arguments if arg_descr.name in self.sort_arg_names] scan_args = args + sorted_args + [base_bit] last_evt = self.scan_kernel(*scan_args, queue=queue, wait_for=wait_for) wait_for = [last_evt] # substitute sorted for i, arg_descr in enumerate(self.arguments): if arg_descr.name in self.sort_arg_names: args[i] = sorted_args[self.sort_arg_names.index(arg_descr.name)] base_bit += self.bits return [arg_val for arg_descr, arg_val in zip(self.arguments, args) if arg_descr.name in self.sort_arg_names], last_evt # }}} # }}} # }}} # {{{ generic parallel list builder # {{{ kernel template _LIST_BUILDER_TEMPLATE = Template("""//CL// % if double_support: #if __OPENCL_C_VERSION__ < 120 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif #define PYOPENCL_DEFINE_CDOUBLE % endif #include ${preamble} // {{{ declare helper macros for user interface typedef ${index_type} index_type; %if is_count_stage: #define PLB_COUNT_STAGE %for name, dtype in list_names_and_dtypes: %if name in count_sharing: #define APPEND_${name}(value) { /* nothing */ } %else: #define APPEND_${name}(value) { ++(*plb_loc_${name}_count); } %endif %endfor %else: #define PLB_WRITE_STAGE %for name, dtype in list_names_and_dtypes: %if name in count_sharing: #define APPEND_${name}(value) \ { plb_${name}_list[(*plb_${count_sharing[name]}_index) - 1] \ = value; } %else: #define APPEND_${name}(value) \ { plb_${name}_list[(*plb_${name}_index)++] = value; } %endif %endfor %endif #define LIST_ARG_DECL ${user_list_arg_decl} #define LIST_ARGS ${user_list_args} #define USER_ARG_DECL ${user_arg_decl_no_offset} #define USER_ARGS ${user_args_no_offset} // }}} ${generate_template} // {{{ kernel entry point __kernel %if do_not_vectorize: __attribute__((reqd_work_group_size(1, 1, 1))) %endif void ${kernel_name}( ${kernel_list_arg_decl} ${user_arg_decl_with_offset} index_type n) { %if not do_not_vectorize: int lid = get_local_id(0); index_type gsize = get_global_size(0); index_type work_group_start = get_local_size(0)*get_group_id(0); for (index_type i = work_group_start + lid; i < n; i += gsize) %else: const int chunk_size = 128; index_type chunk_base = get_global_id(0)*chunk_size; index_type gsize = get_global_size(0); for (; chunk_base < n; chunk_base += gsize*chunk_size) for (index_type i = chunk_base; i < min(n, chunk_base+chunk_size); ++i) %endif { %if is_count_stage: %for name, dtype in list_names_and_dtypes: %if name not in count_sharing: index_type plb_loc_${name}_count = 0; %endif %endfor %else: %for name, dtype in list_names_and_dtypes: %if name not in count_sharing: index_type plb_${name}_index; if (plb_${name}_start_index) %if name in eliminate_empty_output_lists: plb_${name}_index = plb_${name}_start_index[ ${name}_compressed_indices[i] ]; %else: plb_${name}_index = plb_${name}_start_index[i]; %endif else plb_${name}_index = 0; %endif %endfor %endif ${arg_offset_adjustment} generate(${kernel_list_arg_values} USER_ARGS i); %if is_count_stage: %for name, dtype in list_names_and_dtypes: %if name not in count_sharing: if (plb_${name}_count) plb_${name}_count[i] = plb_loc_${name}_count; %endif %endfor %endif } } // }}} """, strict_undefined=True) # }}} def _get_arg_decl(arg_list): result = "" for arg in arg_list: result += arg.declarator() + ", " return result def _get_arg_list(arg_list, prefix=""): result = "" for arg in arg_list: result += prefix + arg.name + ", " return result @dataclass class BuiltList: count: Optional[int] starts: Optional[pyopencl.array.Array] lists: Optional[pyopencl.array.Array] = None num_nonempty_lists: Optional[int] = None nonempty_indices: Optional[pyopencl.array.Array] = None compressed_indices: Optional[pyopencl.array.Array] = None class ListOfListsBuilder: """Generates and executes code to produce a large number of variable-size lists, simply. .. note:: This functionality is provided as a preview. Its interface is subject to change until this notice is removed. .. versionadded:: 2013.1 Here's a usage example:: from pyopencl.algorithm import ListOfListsBuilder builder = ListOfListsBuilder(context, [("mylist", np.int32)], \"\"\" void generate(LIST_ARG_DECL USER_ARG_DECL index_type i) { int count = i % 4; for (int j = 0; j < count; ++j) { APPEND_mylist(count); } } \"\"\", arg_decls=[]) result, event = builder(queue, 2000) inf = result["mylist"] assert inf.count == 3000 assert (inf.list.get()[-6:] == [1, 2, 2, 3, 3, 3]).all() The function ``generate`` above is called once for each "input object". Each input object can then generate zero or more list entries. The number of these input objects is given to :meth:`__call__` as *n_objects*. List entries are generated by calls to ``APPEND_(value)``. Multiple lists may be generated at once. .. automethod:: __init__ .. automethod:: __call__ """ def __init__(self, context, list_names_and_dtypes, generate_template, arg_decls, count_sharing=None, devices=None, name_prefix="plb_build_list", options=None, preamble="", debug=False, complex_kernel=False, eliminate_empty_output_lists=False): """ :arg context: A :class:`pyopencl.Context`. :arg list_names_and_dtypes: a list of ``(name, dtype)`` tuples indicating the lists to be built. :arg generate_template: a snippet of C as described below :arg arg_decls: A string of comma-separated C argument declarations. :arg count_sharing: A mapping consisting of ``(child, mother)`` indicating that ``mother`` and ``child`` will always have the same number of indices, and the ``APPEND`` to ``mother`` will always happen *before* the ``APPEND`` to the child. :arg name_prefix: the name prefix to use for the compiled kernels :arg options: OpenCL compilation options for kernels using *generate_template*. :arg complex_kernel: If *True*, prevents vectorization on CPUs. :arg eliminate_empty_output_lists: A Python list of list names for which the empty output lists are eliminated. *generate_template* may use the following C macros/identifiers: * ``index_type``: expands to C identifier for the index type used for the calculation * ``USER_ARG_DECL``: expands to the C declarator for ``arg_decls`` * ``USER_ARGS``: a list of C argument values corresponding to ``user_arg_decl`` * ``LIST_ARG_DECL``: expands to a C argument list representing the data for the output lists. These are escaped prefixed with ``"plg_"`` so as to not interfere with user-provided names. * ``LIST_ARGS``: a list of C argument values corresponding to ``LIST_ARG_DECL`` * ``APPEND_name(entry)``: inserts ``entry`` into the list ``name``. *entry* must be a valid C expression of the correct type. All argument-list related macros have a trailing comma included if they are non-empty. *generate_template* must supply a function: .. code-block:: c void generate(USER_ARG_DECL LIST_ARG_DECL index_type i) { APPEND_mylist(5); } Internally, the ``kernel_template`` is expanded (at least) twice. Once, for a 'counting' stage where the size of all the lists is determined, and a second time, for a 'generation' stage where the lists are actually filled. A ``generate`` function that has side effects beyond calling ``append`` is therefore ill-formed. .. versionchanged:: 2018.1 Change *eliminate_empty_output_lists* argument type from ``bool`` to ``list``. """ if devices is None: devices = context.devices if count_sharing is None: count_sharing = {} self.context = context self.devices = devices self.list_names_and_dtypes = list_names_and_dtypes self.generate_template = generate_template from pyopencl.tools import parse_arg_list self.arg_decls = parse_arg_list(arg_decls) # To match with the signature of the user-supplied generate(), arguments # can't appear to have offsets. arg_decls_no_offset = [] from pyopencl.tools import VectorArg for arg in self.arg_decls: if isinstance(arg, VectorArg) and arg.with_offset: arg = VectorArg(arg.dtype, arg.name) arg_decls_no_offset.append(arg) self.arg_decls_no_offset = arg_decls_no_offset self.count_sharing = count_sharing self.name_prefix = name_prefix self.preamble = preamble self.options = options self.debug = debug self.complex_kernel = complex_kernel if eliminate_empty_output_lists is True: eliminate_empty_output_lists = \ [name for name, _ in self.list_names_and_dtypes] if eliminate_empty_output_lists is False: eliminate_empty_output_lists = [] self.eliminate_empty_output_lists = eliminate_empty_output_lists for list_name in self.eliminate_empty_output_lists: if not any(list_name == name for name, _ in self.list_names_and_dtypes): raise ValueError( "invalid list name '%s' in eliminate_empty_output_lists" % list_name) # {{{ kernel generators @memoize_method def get_scan_kernel(self, index_dtype): return GenericScanKernel( self.context, index_dtype, arguments="__global %s *ary" % dtype_to_ctype(index_dtype), input_expr="ary[i]", scan_expr="a+b", neutral="0", output_statement="ary[i+1] = item;", devices=self.devices) @memoize_method def get_compress_kernel(self, index_dtype): arguments = """ __global ${index_t} *count, __global ${index_t} *compressed_counts, __global ${index_t} *nonempty_indices, __global ${index_t} *compressed_indices, __global ${index_t} *num_non_empty_list """ arguments = Template(arguments) return GenericScanKernel( self.context, index_dtype, arguments=arguments.render(index_t=dtype_to_ctype(index_dtype)), input_expr="count[i] == 0 ? 0 : 1", scan_expr="a+b", neutral="0", output_statement=""" if (i + 1 < N) compressed_indices[i + 1] = item; if (prev_item != item) { nonempty_indices[item - 1] = i; compressed_counts[item - 1] = count[i]; } if (i + 1 == N) *num_non_empty_list = item; """, devices=self.devices) def do_not_vectorize(self): return (self.complex_kernel and any(dev.type & cl.device_type.CPU for dev in self.context.devices)) @memoize_method def get_count_kernel(self, index_dtype): index_ctype = dtype_to_ctype(index_dtype) from pyopencl.tools import OtherArg, VectorArg kernel_list_args = [ VectorArg(index_dtype, "plb_%s_count" % name) for name, dtype in self.list_names_and_dtypes if name not in self.count_sharing] user_list_args = [] for name, _dtype in self.list_names_and_dtypes: if name in self.count_sharing: continue name = "plb_loc_%s_count" % name user_list_args.append(OtherArg("{} *{}".format( index_ctype, name), name)) kernel_name = self.name_prefix+"_count" from pyopencl.characterize import has_double_support src = _LIST_BUILDER_TEMPLATE.render( is_count_stage=True, kernel_name=kernel_name, double_support=all(has_double_support(dev) for dev in self.context.devices), debug=self.debug, do_not_vectorize=self.do_not_vectorize(), eliminate_empty_output_lists=self.eliminate_empty_output_lists, kernel_list_arg_decl=_get_arg_decl(kernel_list_args), kernel_list_arg_values=_get_arg_list(user_list_args, prefix="&"), user_list_arg_decl=_get_arg_decl(user_list_args), user_list_args=_get_arg_list(user_list_args), user_arg_decl_with_offset=_get_arg_decl(self.arg_decls), user_arg_decl_no_offset=_get_arg_decl(self.arg_decls_no_offset), user_args_no_offset=_get_arg_list(self.arg_decls_no_offset), arg_offset_adjustment=get_arg_offset_adjuster_code(self.arg_decls), list_names_and_dtypes=self.list_names_and_dtypes, count_sharing=self.count_sharing, name_prefix=self.name_prefix, generate_template=self.generate_template, preamble=self.preamble, index_type=index_ctype, ) src = str(src) prg = cl.Program(self.context, src).build(self.options) knl = getattr(prg, kernel_name) from pyopencl.tools import get_arg_list_scalar_arg_dtypes knl.set_scalar_arg_dtypes([ *get_arg_list_scalar_arg_dtypes([*kernel_list_args, *self.arg_decls]), index_dtype ]) return knl @memoize_method def get_write_kernel(self, index_dtype): index_ctype = dtype_to_ctype(index_dtype) from pyopencl.tools import OtherArg, VectorArg kernel_list_args = [] kernel_list_arg_values = "" user_list_args = [] for name, dtype in self.list_names_and_dtypes: list_name = "plb_%s_list" % name list_arg = VectorArg(dtype, list_name) kernel_list_args.append(list_arg) user_list_args.append(list_arg) if name in self.count_sharing: kernel_list_arg_values += "%s, " % list_name continue kernel_list_args.append( VectorArg(index_dtype, "plb_%s_start_index" % name)) if name in self.eliminate_empty_output_lists: kernel_list_args.append( VectorArg(index_dtype, "%s_compressed_indices" % name)) index_name = "plb_%s_index" % name user_list_args.append(OtherArg("{} *{}".format( index_ctype, index_name), index_name)) kernel_list_arg_values += f"{list_name}, &{index_name}, " kernel_name = self.name_prefix+"_write" from pyopencl.characterize import has_double_support src = _LIST_BUILDER_TEMPLATE.render( is_count_stage=False, kernel_name=kernel_name, double_support=all(has_double_support(dev) for dev in self.context.devices), debug=self.debug, do_not_vectorize=self.do_not_vectorize(), eliminate_empty_output_lists=self.eliminate_empty_output_lists, kernel_list_arg_decl=_get_arg_decl(kernel_list_args), kernel_list_arg_values=kernel_list_arg_values, user_list_arg_decl=_get_arg_decl(user_list_args), user_list_args=_get_arg_list(user_list_args), user_arg_decl_with_offset=_get_arg_decl(self.arg_decls), user_arg_decl_no_offset=_get_arg_decl(self.arg_decls_no_offset), user_args_no_offset=_get_arg_list(self.arg_decls_no_offset), arg_offset_adjustment=get_arg_offset_adjuster_code(self.arg_decls), list_names_and_dtypes=self.list_names_and_dtypes, count_sharing=self.count_sharing, name_prefix=self.name_prefix, generate_template=self.generate_template, preamble=self.preamble, index_type=index_ctype, ) src = str(src) prg = cl.Program(self.context, src).build(self.options) knl = getattr(prg, kernel_name) from pyopencl.tools import get_arg_list_scalar_arg_dtypes knl.set_scalar_arg_dtypes([ *get_arg_list_scalar_arg_dtypes(kernel_list_args + self.arg_decls), index_dtype]) return knl # }}} # {{{ driver def __call__(self, queue, n_objects, *args, **kwargs): """ :arg args: arguments corresponding to ``arg_decls`` in the constructor. Array-like arguments must be either 1D :class:`pyopencl.array.Array` objects or :class:`pyopencl.MemoryObject` objects, of which the latter can be obtained from a :class:`pyopencl.array.Array` using the :attr:`pyopencl.array.Array.data` attribute. :arg allocator: optionally, the allocator to use to allocate new arrays. :arg omit_lists: an iterable of list names that should *not* be built with this invocation. The kernel code may *not* call ``APPEND_name`` for these omitted lists. If it does, undefined behavior will result. The returned *lists* dictionary will not contain an entry for names in *omit_lists*. :arg wait_for: |explain-waitfor| :returns: a tuple ``(lists, event)``, where ``lists`` is a mapping from (built) list names to objects which have attributes * ``count`` for the total number of entries in all lists combined * ``lists`` for the array containing all lists. * ``starts`` for the array of starting indices in ``lists``. ``starts`` is built so that it has n+1 entries, so that the *i*'th entry is the start of the *i*'th list, and the *i*'th entry is the index one past the *i*'th list's end, even for the last list. This implies that all lists are contiguous. If the list name is specified in *eliminate_empty_output_lists* constructor argument, *lists* has two additional attributes ``num_nonempty_lists`` and ``nonempty_indices`` * ``num_nonempty_lists`` for the number of nonempty lists. * ``nonempty_indices`` for the index of nonempty list in input objects. In this case, ``starts`` has ``num_nonempty_lists + 1`` entries. The *i*'s entry is the start of the *i*'th nonempty list, which is generated by the object with index ``nonempty_indices[i]``. *event* is a :class:`pyopencl.Event` for dependency management. .. versionchanged:: 2016.2 Added omit_lists. """ if n_objects >= int(np.iinfo(np.int32).max): index_dtype = np.int64 else: index_dtype = np.int32 index_dtype = np.dtype(index_dtype) allocator = kwargs.pop("allocator", None) omit_lists = kwargs.pop("omit_lists", []) wait_for = kwargs.pop("wait_for", None) if kwargs: raise TypeError("invalid keyword arguments: '%s'" % ", ".join(kwargs)) for oml in omit_lists: if not any(oml == name for name, _ in self.list_names_and_dtypes): raise ValueError("invalid list name '%s' in omit_lists") result = {} count_list_args = [] if wait_for is None: wait_for = [] else: # We'll be modifying it below. wait_for = list(wait_for) count_kernel = self.get_count_kernel(index_dtype) write_kernel = self.get_write_kernel(index_dtype) scan_kernel = self.get_scan_kernel(index_dtype) if self.eliminate_empty_output_lists: compress_kernel = self.get_compress_kernel(index_dtype) data_args = [] for i, (arg_descr, arg_val) in enumerate(zip(self.arg_decls, args)): from pyopencl.tools import VectorArg if isinstance(arg_descr, VectorArg): from pyopencl import MemoryObject if arg_val is None: data_args.append(arg_val) if arg_descr.with_offset: data_args.append(0) continue if isinstance(arg_val, MemoryObject): data_args.append(arg_val) if arg_descr.with_offset: raise ValueError( "with_offset=True specified for argument %d " "but the argument is not an array" % i) continue if arg_val.ndim != 1: raise ValueError("argument %d is a multidimensional array" % i) data_args.append(arg_val.base_data) if arg_descr.with_offset: data_args.append(arg_val.offset) wait_for.extend(arg_val.events) else: data_args.append(arg_val) del args data_args = tuple(data_args) # {{{ allocate memory for counts for name, _dtype in self.list_names_and_dtypes: if name in self.count_sharing: continue if name in omit_lists: count_list_args.append(None) continue counts = cl.array.empty(queue, (n_objects + 1), index_dtype, allocator=allocator) counts[-1] = 0 wait_for = wait_for + counts.events # The scan will turn the "counts" array into the "starts" array # in-place. if name in self.eliminate_empty_output_lists: result[name] = BuiltList(count=None, starts=counts, lists=None, num_nonempty_lists=None, nonempty_indices=None) else: result[name] = BuiltList(count=None, starts=counts, lists=None) count_list_args.append(counts.data) # }}} if self.debug: gsize = (1,) lsize = (1,) elif self.do_not_vectorize(): gsize = (4*queue.device.max_compute_units,) lsize = (1,) else: from pyopencl.array import _splay gsize, lsize = _splay(queue.device, n_objects) count_event = count_kernel(queue, gsize, lsize, *(tuple(count_list_args) + data_args + (n_objects,)), wait_for=wait_for) compress_events = {} for name, _dtype in self.list_names_and_dtypes: if name in omit_lists: continue if name in self.count_sharing: continue if name not in self.eliminate_empty_output_lists: continue compressed_counts = cl.array.empty( queue, (n_objects + 1,), index_dtype, allocator=allocator) info_record = result[name] info_record.nonempty_indices = cl.array.empty( queue, (n_objects + 1,), index_dtype, allocator=allocator) info_record.num_nonempty_lists = cl.array.empty( queue, (1,), index_dtype, allocator=allocator) info_record.compressed_indices = cl.array.empty( queue, (n_objects + 1,), index_dtype, allocator=allocator) info_record.compressed_indices[0] = 0 compress_events[name] = compress_kernel( # pylint: disable=possibly-used-before-assignment info_record.starts, compressed_counts, info_record.nonempty_indices, info_record.compressed_indices, info_record.num_nonempty_lists, wait_for=[count_event, *info_record.compressed_indices.events]) info_record.starts = compressed_counts # {{{ run scans scan_events = [] for name, _dtype in self.list_names_and_dtypes: if name in self.count_sharing: continue if name in omit_lists: continue info_record = result[name] if name in self.eliminate_empty_output_lists: compress_events[name].wait() num_nonempty_lists = info_record.num_nonempty_lists.get()[0] info_record.num_nonempty_lists = num_nonempty_lists info_record.starts = info_record.starts[:num_nonempty_lists + 1] info_record.nonempty_indices = \ info_record.nonempty_indices[:num_nonempty_lists] info_record.starts[-1] = 0 starts_ary = info_record.starts if name in self.eliminate_empty_output_lists: evt = scan_kernel( starts_ary, size=info_record.num_nonempty_lists, wait_for=starts_ary.events) else: evt = scan_kernel(starts_ary, wait_for=[count_event], size=n_objects) starts_ary.setitem(0, 0, queue=queue, wait_for=[evt]) scan_events.extend(starts_ary.events) # retrieve count info_record.count = int(starts_ary[-1].get()) # }}} # {{{ deal with count-sharing lists, allocate memory for lists write_list_args = [] for name, dtype in self.list_names_and_dtypes: if name in omit_lists: write_list_args.append(None) if name not in self.count_sharing: write_list_args.append(None) if name in self.eliminate_empty_output_lists: write_list_args.append(None) continue if name in self.count_sharing: sharing_from = self.count_sharing[name] info_record = result[name] = BuiltList( count=result[sharing_from].count, starts=result[sharing_from].starts, ) else: info_record = result[name] info_record.lists = cl.array.empty(queue, info_record.count, dtype, allocator=allocator) write_list_args.append(info_record.lists.data) if name not in self.count_sharing: write_list_args.append(info_record.starts.data) if name in self.eliminate_empty_output_lists: write_list_args.append(info_record.compressed_indices.data) # }}} evt = write_kernel(queue, gsize, lsize, *(tuple(write_list_args) + data_args + (n_objects,)), wait_for=scan_events) return result, evt # }}} # }}} # {{{ key-value sorting @dataclass(frozen=True) class _KernelInfo: by_target_sorter: RadixSort start_finder: ElementwiseKernel bound_propagation_scan: GenericScanKernel def _make_cl_int_literal(value, dtype): iinfo = np.iinfo(dtype) result = str(int(value)) if dtype.itemsize == 8: result += "l" if int(iinfo.min) < 0: result += "u" return result class KeyValueSorter: """Given arrays *values* and *keys* of equal length and a number *nkeys* of keys, returns a tuple `(starts, lists)`, as follows: *values* and *keys* are sorted by *keys*, and the sorted *values* is returned as *lists*. Then for each index *i* in ``range(nkeys)``, *starts[i]* is written to indicating where the group of *values* belonging to the key with index *i* begins. It implicitly ends at *starts[i+1]*. ``starts`` is built so that it has ``nkeys + 1`` entries, so that the *i*'th entry is the start of the *i*'th list, and the *i*'th entry is the index one past the *i*'th list's end, even for the last list. This implies that all lists are contiguous. .. note:: This functionality is provided as a preview. Its interface is subject to change until this notice is removed. .. versionadded:: 2013.1 """ def __init__(self, context): self.context = context @memoize_method def get_kernels(self, key_dtype, value_dtype, starts_dtype): from pyopencl.tools import ScalarArg, VectorArg by_target_sorter = RadixSort( self.context, [ VectorArg(value_dtype, "values"), VectorArg(key_dtype, "keys"), ], key_expr="keys[i]", sort_arg_names=["values", "keys"]) from pyopencl.elementwise import ElementwiseTemplate start_finder = ElementwiseTemplate( arguments="""//CL// starts_t *key_group_starts, key_t *keys_sorted_by_key, """, operation=r"""//CL// key_t my_key = keys_sorted_by_key[i]; if (i == 0 || my_key != keys_sorted_by_key[i-1]) key_group_starts[my_key] = i; """, name="find_starts").build(self.context, type_aliases=( ("key_t", starts_dtype), ("starts_t", starts_dtype), ), var_values=()) bound_propagation_scan = GenericScanKernel( self.context, starts_dtype, arguments=[ VectorArg(starts_dtype, "starts"), # starts has length n+1 ScalarArg(key_dtype, "nkeys"), ], input_expr="starts[nkeys-i]", scan_expr="min(a, b)", neutral=_make_cl_int_literal( np.iinfo(starts_dtype).max, starts_dtype), output_statement="starts[nkeys-i] = item;") return _KernelInfo( by_target_sorter=by_target_sorter, start_finder=start_finder, bound_propagation_scan=bound_propagation_scan) def __call__(self, queue, keys, values, nkeys, starts_dtype, allocator=None, wait_for=None): if allocator is None: allocator = values.allocator knl_info = self.get_kernels(keys.dtype, values.dtype, starts_dtype) (values_sorted_by_key, keys_sorted_by_key), evt = knl_info.by_target_sorter( values, keys, queue=queue, wait_for=wait_for) starts = (cl.array.empty(queue, (nkeys+1), starts_dtype, allocator=allocator) .fill(len(values_sorted_by_key), wait_for=[evt])) evt, = starts.events evt = knl_info.start_finder(starts, keys_sorted_by_key, range=slice(len(keys_sorted_by_key)), wait_for=[evt]) evt = knl_info.bound_propagation_scan(starts, nkeys, queue=queue, wait_for=[evt]) return starts, values_sorted_by_key, evt # }}} # vim: filetype=pyopencl:fdm=marker pyopencl-2025.1/pyopencl/array.py0000644000000000000000000032605414332717401013717 0ustar00"""CL device arrays.""" # NOTE: for elwise_kernel_runner which adds keyword arguments # pylint:disable=unexpected-keyword-arg __copyright__ = "Copyright (C) 2009 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import builtins from dataclasses import dataclass from functools import reduce from numbers import Number from typing import Any, Dict, List, Optional, Tuple, Union from warnings import warn import numpy as np import pyopencl as cl import pyopencl.elementwise as elementwise from pyopencl import cltypes from pyopencl.characterize import has_double_support from pyopencl.compyte.array import ( ArrayFlags as _ArrayFlags, as_strided as _as_strided, c_contiguous_strides as _c_contiguous_strides, equal_strides as _equal_strides, f_contiguous_strides as _f_contiguous_strides, ) SCALAR_CLASSES = (Number, np.bool_, bool) if cl.get_cl_header_version() >= (2, 0): _SVMPointer_or_nothing = cl.SVMPointer else: _SVMPointer_or_nothing = () # {{{ _get_common_dtype class DoubleDowncastWarning(UserWarning): pass _DOUBLE_DOWNCAST_WARNING = ( "The operation you requested would result in a double-precision " "quantity according to numpy semantics. Since your device does not " "support double precision, a single-precision quantity is being returned.") def _get_common_dtype(obj1, obj2, queue): if queue is None: raise ValueError("PyOpenCL array has no queue; call .with_queue() to " "add one in order to be able to perform operations") # Note: We are calling np.result_type with pyopencl arrays here. # Luckily, np.result_type only looks at the dtype of input arrays up until # at least numpy v2.1. result = np.result_type(obj1, obj2) if not has_double_support(queue.device): if result == np.float64: result = np.dtype(np.float32) warn(_DOUBLE_DOWNCAST_WARNING, DoubleDowncastWarning, stacklevel=3) elif result == np.complex128: result = np.dtype(np.complex64) warn(_DOUBLE_DOWNCAST_WARNING, DoubleDowncastWarning, stacklevel=3) return result # }}} # {{{ _get_truedivide_dtype def _get_truedivide_dtype(obj1, obj2, queue): # the dtype of the division result obj1 / obj2 allow_double = has_double_support(queue.device) x1 = obj1 if np.isscalar(obj1) else np.ones(1, obj1.dtype) x2 = obj2 if np.isscalar(obj2) else np.ones(1, obj2.dtype) result = (x1/x2).dtype if not allow_double: if result == np.float64: result = np.dtype(np.float32) elif result == np.complex128: result = np.dtype(np.complex64) return result # }}} # {{{ _get_broadcasted_binary_op_result def _get_broadcasted_binary_op_result(obj1, obj2, cq, dtype_getter=_get_common_dtype): if obj1.shape == obj2.shape: return obj1._new_like_me(dtype_getter(obj1, obj2, cq), cq) elif obj1.shape == (): return obj2._new_like_me(dtype_getter(obj1, obj2, cq), cq) elif obj2.shape == (): return obj1._new_like_me(dtype_getter(obj1, obj2, cq), cq) else: raise NotImplementedError("Broadcasting binary operator with shapes:" f" {obj1.shape}, {obj2.shape}.") # }}} # {{{ VecLookupWarner class VecLookupWarner: def __getattr__(self, name): warn("pyopencl.array.vec is deprecated. " "Please use pyopencl.cltypes for OpenCL vector and scalar types", DeprecationWarning, stacklevel=2) if name == "types": name = "vec_types" elif name == "type_to_scalar_and_count": name = "vec_type_to_scalar_and_count" return getattr(cltypes, name) vec = VecLookupWarner() # }}} # {{{ helper functionality def _splay(device, n, kernel_specific_max_wg_size=None): max_work_items = builtins.min(128, device.max_work_group_size) if kernel_specific_max_wg_size is not None: max_work_items = builtins.min(max_work_items, kernel_specific_max_wg_size) min_work_items = builtins.min(32, max_work_items) max_groups = device.max_compute_units * 4 * 8 # 4 to overfill the device # 8 is an Nvidia constant--that's how many # groups fit onto one compute device if n < min_work_items: group_count = 1 work_items_per_group = min_work_items elif n < (max_groups * min_work_items): group_count = (n + min_work_items - 1) // min_work_items work_items_per_group = min_work_items elif n < (max_groups * max_work_items): group_count = max_groups grp = (n + min_work_items - 1) // min_work_items work_items_per_group = ( (grp + max_groups - 1) // max_groups) * min_work_items else: group_count = max_groups work_items_per_group = max_work_items # print("n:%d gc:%d wipg:%d" % (n, group_count, work_items_per_group)) return (group_count*work_items_per_group,), (work_items_per_group,) # deliberately undocumented for now ARRAY_KERNEL_EXEC_HOOK = None def elwise_kernel_runner(kernel_getter): """Take a kernel getter of the same signature as the kernel and return a function that invokes that kernel. Assumes that the zeroth entry in *args* is an :class:`Array`. """ from functools import wraps @wraps(kernel_getter) def kernel_runner(out, *args, **kwargs): assert isinstance(out, Array) wait_for = kwargs.pop("wait_for", None) queue = kwargs.pop("queue", None) if queue is None: queue = out.queue assert queue is not None knl = kernel_getter(out, *args, **kwargs) work_group_info = knl.get_work_group_info( cl.kernel_work_group_info.WORK_GROUP_SIZE, queue.device) gs, ls = out._get_sizes(queue, work_group_info) args = (out, *args, out.size) if ARRAY_KERNEL_EXEC_HOOK is not None: return ARRAY_KERNEL_EXEC_HOOK( # pylint: disable=not-callable knl, queue, gs, ls, *args, wait_for=wait_for) else: return knl(queue, gs, ls, *args, wait_for=wait_for) return kernel_runner class DefaultAllocator(cl.tools.DeferredAllocator): def __init__(self, *args, **kwargs): warn("pyopencl.array.DefaultAllocator is deprecated. " "It will be continue to exist throughout the 2013.x " "versions of PyOpenCL.", DeprecationWarning, stacklevel=2) cl.tools.DeferredAllocator.__init__(self, *args, **kwargs) # }}} # {{{ array class class InconsistentOpenCLQueueWarning(UserWarning): pass class ArrayHasOffsetError(ValueError): """ .. versionadded:: 2013.1 """ def __init__(self, val="The operation you are attempting does not yet " "support arrays that start at an offset from the beginning " "of their buffer."): ValueError.__init__(self, val) class _copy_queue: # noqa: N801 pass _ARRAY_GET_SIZES_CACHE: Dict[Tuple[int, int, int], Tuple[int, int]] = {} _BOOL_DTYPE = np.dtype(np.int8) _NOT_PRESENT = object() class Array: """A :class:`numpy.ndarray` work-alike that stores its data and performs its computations on the compute device. :attr:`shape` and :attr:`dtype` work exactly as in :mod:`numpy`. Arithmetic methods in :class:`Array` support the broadcasting of scalars. (e.g. ``array + 5``). *cq* must be a :class:`~pyopencl.CommandQueue` or a :class:`~pyopencl.Context`. If it is a queue, *cq* specifies the queue in which the array carries out its computations by default. If a default queue (and thereby overloaded operators and many other niceties) are not desired, pass a :class:`~pyopencl.Context`. *allocator* may be *None* or a callable that, upon being called with an argument of the number of bytes to be allocated, returns a :class:`pyopencl.Buffer` object. (A :class:`pyopencl.tools.MemoryPool` instance is one useful example of an object to pass here.) .. versionchanged:: 2011.1 Renamed *context* to *cqa*, made it general-purpose. All arguments beyond *order* should be considered keyword-only. .. versionchanged:: 2015.2 Renamed *context* to *cq*, disallowed passing allocators through it. .. attribute :: data The :class:`pyopencl.MemoryObject` instance created for the memory that backs this :class:`Array`. .. versionchanged:: 2013.1 If a non-zero :attr:`offset` has been specified for this array, this will fail with :exc:`ArrayHasOffsetError`. .. attribute :: base_data The :class:`pyopencl.MemoryObject` instance created for the memory that backs this :class:`Array`. Unlike :attr:`data`, the base address of *base_data* is allowed to be different from the beginning of the array. The actual beginning is the base address of *base_data* plus :attr:`offset` bytes. Unlike :attr:`data`, retrieving :attr:`base_data` always succeeds. .. versionadded:: 2013.1 .. attribute :: offset See :attr:`base_data`. .. versionadded:: 2013.1 .. attribute :: shape A tuple of lengths of each dimension in the array. .. attribute :: ndim The number of dimensions in :attr:`shape`. .. attribute :: dtype The :class:`numpy.dtype` of the items in the GPU array. .. attribute :: size The number of meaningful entries in the array. Can also be computed by multiplying up the numbers in :attr:`shape`. .. attribute :: nbytes The size of the entire array in bytes. Computed as :attr:`size` times ``dtype.itemsize``. .. attribute :: strides A tuple of bytes to step in each dimension when traversing an array. .. attribute :: flags An object with attributes ``c_contiguous``, ``f_contiguous`` and ``forc``, which may be used to query contiguity properties in analogy to :attr:`numpy.ndarray.flags`. .. rubric:: Methods .. automethod :: with_queue .. automethod :: __len__ .. automethod :: reshape .. automethod :: ravel .. automethod :: view .. automethod :: squeeze .. automethod :: transpose .. attribute :: T .. automethod :: set .. automethod :: get .. automethod :: get_async .. automethod :: copy .. automethod :: __str__ .. automethod :: __repr__ .. automethod :: mul_add .. automethod :: __add__ .. automethod :: __sub__ .. automethod :: __iadd__ .. automethod :: __isub__ .. automethod :: __pos__ .. automethod :: __neg__ .. automethod :: __mul__ .. automethod :: __div__ .. automethod :: __rdiv__ .. automethod :: __pow__ .. automethod :: __and__ .. automethod :: __xor__ .. automethod :: __or__ .. automethod :: __iand__ .. automethod :: __ixor__ .. automethod :: __ior__ .. automethod :: __abs__ .. automethod :: __invert__ .. UNDOC reverse() .. automethod :: fill .. automethod :: astype .. autoattribute :: real .. autoattribute :: imag .. automethod :: conj .. automethod :: conjugate .. automethod :: __getitem__ .. automethod :: __setitem__ .. automethod :: setitem .. automethod :: map_to_host .. rubric:: Comparisons, conditionals, any, all .. versionadded:: 2013.2 Boolean arrays are stored as :class:`numpy.int8` because ``bool`` has an unspecified size in the OpenCL spec. .. automethod :: __bool__ Only works for device scalars. (i.e. "arrays" with ``shape == ()``) .. automethod :: any .. automethod :: all .. automethod :: __eq__ .. automethod :: __ne__ .. automethod :: __lt__ .. automethod :: __le__ .. automethod :: __gt__ .. automethod :: __ge__ .. rubric:: Event management If an array is used from within an out-of-order queue, it needs to take care of its own operation ordering. The facilities in this section make this possible. .. versionadded:: 2014.1.1 .. attribute:: events A list of :class:`pyopencl.Event` instances that the current content of this array depends on. User code may read, but should never modify this list directly. To update this list, instead use the following methods. .. automethod:: add_event .. automethod:: finish """ __array_priority__ = 100 def __init__( self, cq: Optional[Union[cl.Context, cl.CommandQueue]], shape: Union[Tuple[int, ...], int], dtype: Any, order: str = "C", allocator: Optional[cl.tools.AllocatorBase] = None, data: Any = None, offset: int = 0, strides: Optional[Tuple[int, ...]] = None, events: Optional[List[cl.Event]] = None, # NOTE: following args are used for the fast constructor _flags: Any = None, _fast: bool = False, _size: Optional[int] = None, _context: Optional[cl.Context] = None, _queue: Optional[cl.CommandQueue] = None) -> None: if _fast: # Assumptions, should be disabled if not testing if 0: assert cq is None assert isinstance(_context, cl.Context) assert _queue is None or isinstance(_queue, cl.CommandQueue) assert isinstance(shape, tuple) assert isinstance(strides, tuple) assert isinstance(dtype, np.dtype) assert _size is not None size = _size context = _context queue = _queue alloc_nbytes = dtype.itemsize * size else: # {{{ backward compatibility if cq is None: context = _context queue = _queue elif isinstance(cq, cl.CommandQueue): queue = cq context = queue.context elif isinstance(cq, cl.Context): context = cq queue = None else: raise TypeError( f"cq may be a queue or a context, not '{type(cq).__name__}'") if allocator is not None: # "is" would be wrong because two Python objects are allowed # to hold handles to the same context. # FIXME It would be nice to check this. But it would require # changing the allocator interface. Trust the user for now. # assert allocator.context == context pass # Queue-less arrays do have a purpose in life. # They don't do very much, but at least they don't run kernels # in random queues. # # See also :meth:`with_queue`. del cq # }}} # invariant here: allocator, queue set # {{{ determine shape, size, and strides dtype = np.dtype(dtype) try: shape = tuple(shape) # type: ignore[arg-type] except TypeError as err: if not isinstance(shape, (int, np.integer)): raise TypeError( "shape must either be iterable or castable to an integer: " f"got a '{type(shape).__name__}'") from err shape = (shape,) shape_array = np.array(shape) # Previously, the size was computed as # "size = 1; size *= dim for dim in shape" # However this can fail when using certain data types, # eg numpy.uint64(1) * 2 returns 2.0 ! if np.any(shape_array < 0): raise ValueError(f"negative dimensions are not allowed: {shape}") if np.any([np.array([s]).dtype.kind not in ["u", "i"] for s in shape]): raise ValueError( "Invalid shape %s ; dimensions, must be integer" % (str(shape))) size = np.prod(shape_array, dtype=np.uint64).item() if strides is None: if order in "cC": # inlined from compyte.array.c_contiguous_strides if shape: strides_tmp = [dtype.itemsize] for s in shape[:0:-1]: # NOTE: https://github.com/inducer/compyte/pull/36 strides_tmp.append(strides_tmp[-1]*builtins.max(1, s)) strides = tuple(strides_tmp[::-1]) else: strides = () elif order in "fF": strides = _f_contiguous_strides(dtype.itemsize, shape) else: raise ValueError(f"invalid order: {order}") else: # FIXME: We should possibly perform some plausibility # checking on 'strides' here. strides = tuple(strides) # }}} assert dtype != object, \ "object arrays on the compute device are not allowed" # noqa: E721 assert isinstance(shape, tuple) assert isinstance(strides, tuple) alloc_nbytes = dtype.itemsize * size if alloc_nbytes < 0: raise ValueError("cannot allocate CL buffer with negative size") self.queue = queue self.shape = shape self.dtype = dtype self.strides = strides self.events = [] if events is None else events self.nbytes = alloc_nbytes self.size = size self.allocator = allocator if data is None: if alloc_nbytes == 0: self.base_data = None else: if self.allocator is None: if context is None and queue is not None: context = queue.context self.base_data = cl.Buffer( context, cl.mem_flags.READ_WRITE, alloc_nbytes) else: self.base_data = self.allocator(alloc_nbytes) else: self.base_data = data self.offset = offset self.context = context self._flags = _flags if __debug__: if queue is not None and isinstance( self.base_data, _SVMPointer_or_nothing): mem_queue = getattr(self.base_data, "_queue", _NOT_PRESENT) if mem_queue is not _NOT_PRESENT and mem_queue != queue: warn("Array has different queue from backing SVM memory. " "This may lead to the array getting deallocated sooner " "than expected, potentially leading to crashes.", InconsistentOpenCLQueueWarning, stacklevel=2) @property def ndim(self): return len(self.shape) @property def data(self): if self.offset: raise ArrayHasOffsetError() else: return self.base_data @property def flags(self): f = self._flags if f is None: self._flags = f = _ArrayFlags(self) return f def _new_with_changes(self, data, offset, shape=None, dtype=None, strides=None, queue=_copy_queue, allocator=None): """ :arg data: *None* means allocate a new array. """ fast = True size = self.size if shape is None: shape = self.shape else: fast = False size = None if dtype is None: dtype = self.dtype if strides is None: strides = self.strides if queue is _copy_queue: queue = self.queue if allocator is None: allocator = self.allocator # If we're allocating new data, then there's not likely to be # a data dependency. Otherwise, the two arrays should probably # share the same events list. if data is None: events = None else: events = self.events return self.__class__(None, shape, dtype, allocator=allocator, strides=strides, data=data, offset=offset, events=events, _fast=fast, _context=self.context, _queue=queue, _size=size) def with_queue(self, queue): """Return a copy of *self* with the default queue set to *queue*. *None* is allowed as a value for *queue*. .. versionadded:: 2013.1 """ if queue is not None: assert queue.context == self.context return self._new_with_changes(self.base_data, self.offset, queue=queue) def _get_sizes(self, queue, kernel_specific_max_wg_size=None): if not self.flags.forc: raise NotImplementedError("cannot operate on non-contiguous array") cache_key = (queue.device.int_ptr, self.size, kernel_specific_max_wg_size) try: return _ARRAY_GET_SIZES_CACHE[cache_key] except KeyError: sizes = _splay(queue.device, self.size, kernel_specific_max_wg_size=kernel_specific_max_wg_size) _ARRAY_GET_SIZES_CACHE[cache_key] = sizes return sizes def set(self, ary, queue=None, async_=None, **kwargs): """Transfer the contents the :class:`numpy.ndarray` object *ary* onto the device. *ary* must have the same dtype and size (not necessarily shape) as *self*. *async_* is a Boolean indicating whether the function is allowed to return before the transfer completes. To avoid synchronization bugs, this defaults to *False*. .. versionchanged:: 2017.2.1 Python 3.7 makes ``async`` a reserved keyword. On older Pythons, we will continue to accept *async* as a parameter, however this should be considered deprecated. *async_* is the new, official spelling. """ # {{{ handle 'async' deprecation async_arg = kwargs.pop("async", None) if async_arg is not None: if async_ is not None: raise TypeError("may not specify both 'async' and 'async_'") async_ = async_arg if async_ is None: async_ = False if kwargs: raise TypeError("extra keyword arguments specified: %s" % ", ".join(kwargs)) # }}} assert ary.size == self.size assert ary.dtype == self.dtype if not ary.flags.forc: raise RuntimeError("cannot set from non-contiguous array") if not _equal_strides(ary.strides, self.strides, self.shape): warn("Setting array from one with different " "strides/storage order. This will cease to work " "in 2013.x.", stacklevel=2) if self.size: event1 = cl.enqueue_copy(queue or self.queue, self.base_data, ary, dst_offset=self.offset, is_blocking=not async_) self.add_event(event1) def _get(self, queue=None, ary=None, async_=None, **kwargs): # {{{ handle 'async' deprecation async_arg = kwargs.pop("async", None) if async_arg is not None: if async_ is not None: raise TypeError("may not specify both 'async' and 'async_'") async_ = async_arg if async_ is None: async_ = False if kwargs: raise TypeError("extra keyword arguments specified: %s" % ", ".join(kwargs)) # }}} if ary is None: ary = np.empty(self.shape, self.dtype) if self.strides != ary.strides: ary = _as_strided(ary, strides=self.strides) else: if ary.size != self.size: raise TypeError("'ary' has non-matching size") if ary.dtype != self.dtype: raise TypeError("'ary' has non-matching type") if self.shape != ary.shape: warn("get() between arrays of different shape is deprecated " "and will be removed in PyCUDA 2017.x", DeprecationWarning, stacklevel=2) assert self.flags.forc, "Array in get() must be contiguous" queue = queue or self.queue if queue is None: raise ValueError("Cannot copy array to host. " "Array has no queue. Use " "'new_array = array.with_queue(queue)' " "to associate one.") if self.size: event1 = cl.enqueue_copy(queue, ary, self.base_data, src_offset=self.offset, wait_for=self.events, is_blocking=not async_) self.add_event(event1) else: event1 = None return ary, event1 def get(self, queue=None, ary=None, async_=None, **kwargs): """Transfer the contents of *self* into *ary* or a newly allocated :class:`numpy.ndarray`. If *ary* is given, it must have the same shape and dtype. .. versionchanged:: 2019.1.2 Calling with ``async_=True`` was deprecated and replaced by :meth:`get_async`. The event returned by :meth:`pyopencl.enqueue_copy` is now stored into :attr:`events` to ensure data is not modified before the copy is complete. .. versionchanged:: 2015.2 *ary* with different shape was deprecated. .. versionchanged:: 2017.2.1 Python 3.7 makes ``async`` a reserved keyword. On older Pythons, we will continue to accept *async* as a parameter, however this should be considered deprecated. *async_* is the new, official spelling. """ if async_: warn("calling pyopencl.Array.get with 'async_=True' is deprecated. " "Please use pyopencl.Array.get_async for asynchronous " "device-to-host transfers", DeprecationWarning, stacklevel=2) ary, _event1 = self._get(queue=queue, ary=ary, async_=async_, **kwargs) return ary def get_async(self, queue=None, ary=None, **kwargs): """ Asynchronous version of :meth:`get` which returns a tuple ``(ary, event)`` containing the host array ``ary`` and the :class:`pyopencl.NannyEvent` ``event`` returned by :meth:`pyopencl.enqueue_copy`. .. versionadded:: 2019.1.2 """ return self._get(queue=queue, ary=ary, async_=True, **kwargs) def copy(self, queue=_copy_queue): """ :arg queue: The :class:`~pyopencl.CommandQueue` for the returned array. .. versionchanged:: 2017.1.2 Updates the queue of the returned array. .. versionadded:: 2013.1 """ if queue is _copy_queue: queue = self.queue result = self._new_like_me(queue=queue) # result.queue won't be the same as queue if queue is None. # We force them to be the same here. if result.queue is not queue: result = result.with_queue(queue) if not self.flags.forc: raise RuntimeError("cannot copy non-contiguous array") if self.nbytes: event1 = cl.enqueue_copy(queue or self.queue, result.base_data, self.base_data, src_offset=self.offset, byte_count=self.nbytes, wait_for=self.events) result.add_event(event1) return result def __str__(self): if self.queue is None: return (f"") return str(self.get()) def __repr__(self): if self.queue is None: return (f"") result = repr(self.get()) if result[:5] == "array": result = f"cl.{type(self).__name__}" + result[5:] else: warn( f"{type(result).__name__}.__repr__ was expected to return a " f"string starting with 'array', got '{result[:10]!r}'", stacklevel=2) return result def safely_stringify_for_pudb(self): return f"cl.{type(self).__name__} {self.dtype} {self.shape}" def __hash__(self): raise TypeError("pyopencl arrays are not hashable.") # {{{ kernel invocation wrappers @staticmethod @elwise_kernel_runner def _axpbyz(out, afac, a, bfac, b, queue=None): """Compute ``out = selffac * self + otherfac*other``, where *other* is an array.""" a_shape = a.shape b_shape = b.shape out_shape = out.shape assert (a_shape == b_shape == out_shape or (a_shape == () and b_shape == out_shape) or (b_shape == () and a_shape == out_shape)) return elementwise.get_axpbyz_kernel( out.context, a.dtype, b.dtype, out.dtype, x_is_scalar=(a_shape == ()), y_is_scalar=(b_shape == ())) @staticmethod @elwise_kernel_runner def _axpbz(out, a, x, b, queue=None): """Compute ``z = a * x + b``, where *b* is a scalar.""" a = np.array(a) b = np.array(b) assert out.shape == x.shape return elementwise.get_axpbz_kernel(out.context, a.dtype, x.dtype, b.dtype, out.dtype) @staticmethod @elwise_kernel_runner def _elwise_multiply(out, a, b, queue=None): a_shape = a.shape b_shape = b.shape out_shape = out.shape assert (a_shape == b_shape == out_shape or (a_shape == () and b_shape == out_shape) or (b_shape == () and a_shape == out_shape)) return elementwise.get_multiply_kernel( a.context, a.dtype, b.dtype, out.dtype, x_is_scalar=(a_shape == ()), y_is_scalar=(b_shape == ()) ) @staticmethod @elwise_kernel_runner def _rdiv_scalar(out, ary, other, queue=None): other = np.array(other) assert out.shape == ary.shape return elementwise.get_rdivide_elwise_kernel( out.context, ary.dtype, other.dtype, out.dtype) @staticmethod @elwise_kernel_runner def _div(out, self, other, queue=None): """Divides an array by another array.""" assert (self.shape == other.shape == out.shape or (self.shape == () and other.shape == out.shape) or (other.shape == () and self.shape == out.shape)) return elementwise.get_divide_kernel(self.context, self.dtype, other.dtype, out.dtype, x_is_scalar=(self.shape == ()), y_is_scalar=(other.shape == ())) @staticmethod @elwise_kernel_runner def _fill(result, scalar): return elementwise.get_fill_kernel(result.context, result.dtype) @staticmethod @elwise_kernel_runner def _abs(result, arg): if arg.dtype.kind == "c": from pyopencl.elementwise import complex_dtype_to_name fname = "%s_abs" % complex_dtype_to_name(arg.dtype) elif arg.dtype.kind == "f": fname = "fabs" elif arg.dtype.kind in ["u", "i"]: fname = "abs" else: raise TypeError("unsupported dtype in _abs()") return elementwise.get_unary_func_kernel( arg.context, fname, arg.dtype, out_dtype=result.dtype) @staticmethod @elwise_kernel_runner def _real(result, arg): from pyopencl.elementwise import complex_dtype_to_name fname = "%s_real" % complex_dtype_to_name(arg.dtype) return elementwise.get_unary_func_kernel( arg.context, fname, arg.dtype, out_dtype=result.dtype) @staticmethod @elwise_kernel_runner def _imag(result, arg): from pyopencl.elementwise import complex_dtype_to_name fname = "%s_imag" % complex_dtype_to_name(arg.dtype) return elementwise.get_unary_func_kernel( arg.context, fname, arg.dtype, out_dtype=result.dtype) @staticmethod @elwise_kernel_runner def _conj(result, arg): from pyopencl.elementwise import complex_dtype_to_name fname = "%s_conj" % complex_dtype_to_name(arg.dtype) return elementwise.get_unary_func_kernel( arg.context, fname, arg.dtype, out_dtype=result.dtype) @staticmethod @elwise_kernel_runner def _pow_scalar(result, ary, exponent): exponent = np.array(exponent) return elementwise.get_pow_kernel(result.context, ary.dtype, exponent.dtype, result.dtype, is_base_array=True, is_exp_array=False) @staticmethod @elwise_kernel_runner def _rpow_scalar(result, base, exponent): base = np.array(base) return elementwise.get_pow_kernel(result.context, base.dtype, exponent.dtype, result.dtype, is_base_array=False, is_exp_array=True) @staticmethod @elwise_kernel_runner def _pow_array(result, base, exponent): return elementwise.get_pow_kernel( result.context, base.dtype, exponent.dtype, result.dtype, is_base_array=True, is_exp_array=True) @staticmethod @elwise_kernel_runner def _reverse(result, ary): return elementwise.get_reverse_kernel(result.context, ary.dtype) @staticmethod @elwise_kernel_runner def _copy(dest, src): return elementwise.get_copy_kernel( dest.context, dest.dtype, src.dtype) def _new_like_me(self, dtype=None, queue=None): if dtype is None: dtype = self.dtype strides = self.strides flags = self.flags fast = True else: strides = None flags = None if dtype == self.dtype: strides = self.strides flags = self.flags fast = True else: fast = False queue = queue or self.queue return self.__class__(None, self.shape, dtype, allocator=self.allocator, strides=strides, _flags=flags, _fast=fast, _size=self.size, _queue=queue, _context=self.context) @staticmethod @elwise_kernel_runner def _scalar_binop(out, a, b, queue=None, op=None): return elementwise.get_array_scalar_binop_kernel( out.context, op, out.dtype, a.dtype, np.array(b).dtype) @staticmethod @elwise_kernel_runner def _array_binop(out, a, b, queue=None, op=None): a_shape = a.shape b_shape = b.shape out_shape = out.shape assert (a_shape == b_shape == out_shape or (a_shape == () and b_shape == out_shape) or (b_shape == () and a_shape == out_shape)) return elementwise.get_array_binop_kernel( out.context, op, out.dtype, a.dtype, b.dtype, a_is_scalar=(a_shape == ()), b_is_scalar=(b_shape == ())) @staticmethod @elwise_kernel_runner def _unop(out, a, queue=None, op=None): if out.shape != a.shape: raise ValueError("shapes of arguments do not match") return elementwise.get_unop_kernel( out.context, op, a.dtype, out.dtype) # }}} # {{{ operators def mul_add(self, selffac, other, otherfac, queue=None): """Return ``selffac * self + otherfac * other``. """ queue = queue or self.queue if isinstance(other, Array): result = _get_broadcasted_binary_op_result(self, other, queue) result.add_event( self._axpbyz( result, selffac, self, otherfac, other, queue=queue)) return result elif np.isscalar(other): common_dtype = _get_common_dtype(self, other, queue) result = self._new_like_me(common_dtype, queue=queue) result.add_event( self._axpbz(result, selffac, self, common_dtype.type(otherfac * other), queue=queue)) return result else: raise NotImplementedError def __add__(self, other): """Add an array with an array or an array with a scalar.""" if isinstance(other, Array): result = _get_broadcasted_binary_op_result(self, other, self.queue) result.add_event( self._axpbyz(result, self.dtype.type(1), self, other.dtype.type(1), other)) return result elif np.isscalar(other): if other == 0: return self.copy() else: common_dtype = _get_common_dtype(self, other, self.queue) result = self._new_like_me(common_dtype) result.add_event( self._axpbz(result, self.dtype.type(1), self, common_dtype.type(other))) return result else: return NotImplemented __radd__ = __add__ def __sub__(self, other): """Subtract an array from an array or a scalar from an array.""" if isinstance(other, Array): result = _get_broadcasted_binary_op_result(self, other, self.queue) result.add_event( self._axpbyz(result, self.dtype.type(1), self, result.dtype.type(-1), other)) return result elif np.isscalar(other): if other == 0: return self.copy() else: result = self._new_like_me( _get_common_dtype(self, other, self.queue)) result.add_event( self._axpbz(result, self.dtype.type(1), self, -other)) return result else: return NotImplemented def __rsub__(self, other): """Subtracts an array by a scalar or an array:: x = n - self """ if np.isscalar(other): common_dtype = _get_common_dtype(self, other, self.queue) result = self._new_like_me(common_dtype) result.add_event( self._axpbz(result, result.dtype.type(-1), self, common_dtype.type(other))) return result else: return NotImplemented def __iadd__(self, other): if isinstance(other, Array): if other.shape != self.shape and other.shape != (): raise NotImplementedError("Broadcasting binary op with shapes:" f" {self.shape}, {other.shape}.") self.add_event( self._axpbyz(self, self.dtype.type(1), self, other.dtype.type(1), other)) return self elif np.isscalar(other): self.add_event( self._axpbz(self, self.dtype.type(1), self, other)) return self else: return NotImplemented def __isub__(self, other): if isinstance(other, Array): if other.shape != self.shape and other.shape != (): raise NotImplementedError("Broadcasting binary op with shapes:" f" {self.shape}, {other.shape}.") self.add_event( self._axpbyz(self, self.dtype.type(1), self, other.dtype.type(-1), other)) return self elif np.isscalar(other): self._axpbz(self, self.dtype.type(1), self, -other) return self else: return NotImplemented def __pos__(self): return self def __neg__(self): result = self._new_like_me() result.add_event(self._axpbz(result, -1, self, 0)) return result def __mul__(self, other): if isinstance(other, Array): result = _get_broadcasted_binary_op_result(self, other, self.queue) result.add_event( self._elwise_multiply(result, self, other)) return result elif np.isscalar(other): common_dtype = _get_common_dtype(self, other, self.queue) result = self._new_like_me(common_dtype) result.add_event( self._axpbz(result, common_dtype.type(other), self, self.dtype.type(0))) return result else: return NotImplemented def __rmul__(self, other): if np.isscalar(other): common_dtype = _get_common_dtype(self, other, self.queue) result = self._new_like_me(common_dtype) result.add_event( self._axpbz(result, common_dtype.type(other), self, self.dtype.type(0))) return result else: return NotImplemented def __imul__(self, other): if isinstance(other, Array): if other.shape != self.shape and other.shape != (): raise NotImplementedError("Broadcasting binary op with shapes:" f" {self.shape}, {other.shape}.") self.add_event( self._elwise_multiply(self, self, other)) return self elif np.isscalar(other): self.add_event( self._axpbz(self, other, self, self.dtype.type(0))) return self else: return NotImplemented def __div__(self, other): """Divides an array by an array or a scalar, i.e. ``self / other``. """ if isinstance(other, Array): result = _get_broadcasted_binary_op_result( self, other, self.queue, dtype_getter=_get_truedivide_dtype) result.add_event(self._div(result, self, other)) return result elif np.isscalar(other): if other == 1: return self.copy() else: common_dtype = _get_truedivide_dtype(self, other, self.queue) result = self._new_like_me(common_dtype) result.add_event( self._axpbz(result, np.true_divide(common_dtype.type(1), other), self, self.dtype.type(0))) return result else: return NotImplemented __truediv__ = __div__ def __rdiv__(self, other): """Divides an array by a scalar or an array, i.e. ``other / self``. """ common_dtype = _get_truedivide_dtype(self, other, self.queue) if isinstance(other, Array): result = self._new_like_me(common_dtype) result.add_event(other._div(result, self)) return result elif np.isscalar(other): result = self._new_like_me(common_dtype) result.add_event( self._rdiv_scalar(result, self, common_dtype.type(other))) return result else: return NotImplemented __rtruediv__ = __rdiv__ def __itruediv__(self, other): # raise an error if the result cannot be cast to self common_dtype = _get_truedivide_dtype(self, other, self.queue) if not np.can_cast(common_dtype, self.dtype.type, "same_kind"): raise TypeError( "Cannot cast {!r} to {!r}".format(self.dtype, common_dtype)) if isinstance(other, Array): if other.shape != self.shape and other.shape != (): raise NotImplementedError("Broadcasting binary op with shapes:" f" {self.shape}, {other.shape}.") self.add_event( self._div(self, self, other)) return self elif np.isscalar(other): if other == 1: return self else: self.add_event( self._axpbz(self, common_dtype.type(np.true_divide(1, other)), self, self.dtype.type(0))) return self else: return NotImplemented def __and__(self, other): common_dtype = _get_common_dtype(self, other, self.queue) if not np.issubdtype(common_dtype, np.integer): raise TypeError(f"Integral types only: {common_dtype}") if isinstance(other, Array): result = _get_broadcasted_binary_op_result(self, other, self.queue) result.add_event(self._array_binop(result, self, other, op="&")) return result elif np.isscalar(other): result = self._new_like_me(common_dtype) result.add_event( self._scalar_binop(result, self, other, op="&")) return result else: return NotImplemented __rand__ = __and__ # commutes def __or__(self, other): common_dtype = _get_common_dtype(self, other, self.queue) if not np.issubdtype(common_dtype, np.integer): raise TypeError("Integral types only") if isinstance(other, Array): result = _get_broadcasted_binary_op_result(self, other, self.queue) result.add_event(self._array_binop(result, self, other, op="|")) return result elif np.isscalar(other): result = self._new_like_me(common_dtype) result.add_event( self._scalar_binop(result, self, other, op="|")) return result else: return NotImplemented __ror__ = __or__ # commutes def __xor__(self, other): common_dtype = _get_common_dtype(self, other, self.queue) if not np.issubdtype(common_dtype, np.integer): raise TypeError(f"Integral types only: {common_dtype}") if isinstance(other, Array): result = _get_broadcasted_binary_op_result(self, other, self.queue) result.add_event(self._array_binop(result, self, other, op="^")) return result elif np.isscalar(other): result = self._new_like_me(common_dtype) result.add_event( self._scalar_binop(result, self, other, op="^")) return result else: return NotImplemented __rxor__ = __xor__ # commutes def __iand__(self, other): common_dtype = _get_common_dtype(self, other, self.queue) if not np.issubdtype(common_dtype, np.integer): raise TypeError(f"Integral types only: {common_dtype}") if isinstance(other, Array): if other.shape != self.shape and other.shape != (): raise NotImplementedError("Broadcasting binary op with shapes:" f" {self.shape}, {other.shape}.") self.add_event(self._array_binop(self, self, other, op="&")) return self elif np.isscalar(other): self.add_event( self._scalar_binop(self, self, other, op="&")) return self else: return NotImplemented def __ior__(self, other): common_dtype = _get_common_dtype(self, other, self.queue) if not np.issubdtype(common_dtype, np.integer): raise TypeError(f"Integral types only: {common_dtype}") if isinstance(other, Array): if other.shape != self.shape and other.shape != (): raise NotImplementedError("Broadcasting binary op with shapes:" f" {self.shape}, {other.shape}.") self.add_event(self._array_binop(self, self, other, op="|")) return self elif np.isscalar(other): self.add_event( self._scalar_binop(self, self, other, op="|")) return self else: return NotImplemented def __ixor__(self, other): common_dtype = _get_common_dtype(self, other, self.queue) if not np.issubdtype(common_dtype, np.integer): raise TypeError(f"Integral types only: {common_dtype}") if isinstance(other, Array): if other.shape != self.shape and other.shape != (): raise NotImplementedError("Broadcasting binary op with shapes:" f" {self.shape}, {other.shape}.") self.add_event(self._array_binop(self, self, other, op="^")) return self elif np.isscalar(other): self.add_event( self._scalar_binop(self, self, other, op="^")) return self else: return NotImplemented def _zero_fill(self, queue=None, wait_for=None): queue = queue or self.queue if not self.size: return cl_version_gtr_1_2 = ( queue._get_cl_version() >= (1, 2) and cl.get_cl_header_version() >= (1, 2) ) on_nvidia = queue.device.vendor.startswith("NVIDIA") # circumvent bug with large buffers on NVIDIA # https://github.com/inducer/pyopencl/issues/395 if cl_version_gtr_1_2 and not (on_nvidia and self.nbytes >= 2**31): self.add_event( cl.enqueue_fill(queue, self.base_data, np.int8(0), self.nbytes, offset=self.offset, wait_for=wait_for)) else: zero = np.zeros((), self.dtype) self.fill(zero, queue=queue) def fill(self, value, queue=None, wait_for=None): """Fill the array with *scalar*. :returns: *self*. """ self.add_event( self._fill(self, value, queue=queue, wait_for=wait_for)) return self def __len__(self): """Returns the size of the leading dimension of *self*.""" if len(self.shape): return self.shape[0] else: return TypeError("len() of unsized object") def __abs__(self): """Return an ``Array`` of the absolute values of the elements of *self*. """ result = self._new_like_me(self.dtype.type(0).real.dtype) result.add_event(self._abs(result, self)) return result def __pow__(self, other): """Exponentiation by a scalar or elementwise by another :class:`Array`. """ if isinstance(other, Array): assert self.shape == other.shape result = self._new_like_me( _get_common_dtype(self, other, self.queue)) result.add_event( self._pow_array(result, self, other)) return result elif np.isscalar(other): result = self._new_like_me( _get_common_dtype(self, other, self.queue)) result.add_event(self._pow_scalar(result, self, other)) return result else: return NotImplemented def __rpow__(self, other): if np.isscalar(other): common_dtype = _get_common_dtype(self, other, self.queue) result = self._new_like_me(common_dtype) result.add_event( self._rpow_scalar(result, common_dtype.type(other), self)) return result else: return NotImplemented def __invert__(self): if not np.issubdtype(self.dtype, np.integer): raise TypeError(f"Integral types only: {self.dtype}") result = self._new_like_me() result.add_event(self._unop(result, self, op="~")) return result # }}} def reverse(self, queue=None): """Return this array in reversed order. The array is treated as one-dimensional. """ result = self._new_like_me() result.add_event(self._reverse(result, self)) return result def astype(self, dtype, queue=None): """Return a copy of *self*, cast to *dtype*.""" if dtype == self.dtype: return self.copy() result = self._new_like_me(dtype=dtype) result.add_event(self._copy(result, self, queue=queue)) return result # {{{ rich comparisons, any, all def __bool__(self): if self.shape == (): return bool(self.get()) else: raise ValueError("The truth value of an array with " "more than one element is ambiguous. Use a.any() or a.all()") def any(self, queue=None, wait_for=None): from pyopencl.reduction import get_any_kernel krnl = get_any_kernel(self.context, self.dtype) if wait_for is None: wait_for = [] result, event1 = krnl(self, queue=queue, wait_for=wait_for + self.events, return_event=True) result.add_event(event1) return result def all(self, queue=None, wait_for=None): from pyopencl.reduction import get_all_kernel krnl = get_all_kernel(self.context, self.dtype) if wait_for is None: wait_for = [] result, event1 = krnl(self, queue=queue, wait_for=wait_for + self.events, return_event=True) result.add_event(event1) return result @staticmethod @elwise_kernel_runner def _scalar_comparison(out, a, b, queue=None, op=None): return elementwise.get_array_scalar_comparison_kernel( out.context, op, a.dtype) @staticmethod @elwise_kernel_runner def _array_comparison(out, a, b, queue=None, op=None): if a.shape != b.shape: raise ValueError("shapes of comparison arguments do not match") return elementwise.get_array_comparison_kernel( out.context, op, a.dtype, b.dtype) def __eq__(self, other): if isinstance(other, Array): result = self._new_like_me(_BOOL_DTYPE) result.add_event( self._array_comparison(result, self, other, op="==")) return result elif np.isscalar(other): result = self._new_like_me(_BOOL_DTYPE) result.add_event( self._scalar_comparison(result, self, other, op="==")) return result else: return NotImplemented def __ne__(self, other): if isinstance(other, Array): result = self._new_like_me(_BOOL_DTYPE) result.add_event( self._array_comparison(result, self, other, op="!=")) return result elif np.isscalar(other): result = self._new_like_me(_BOOL_DTYPE) result.add_event( self._scalar_comparison(result, self, other, op="!=")) return result else: return NotImplemented def __le__(self, other): if isinstance(other, Array): result = self._new_like_me(_BOOL_DTYPE) result.add_event( self._array_comparison(result, self, other, op="<=")) return result elif np.isscalar(other): result = self._new_like_me(_BOOL_DTYPE) self._scalar_comparison(result, self, other, op="<=") return result else: return NotImplemented def __ge__(self, other): if isinstance(other, Array): result = self._new_like_me(_BOOL_DTYPE) result.add_event( self._array_comparison(result, self, other, op=">=")) return result elif np.isscalar(other): result = self._new_like_me(_BOOL_DTYPE) result.add_event( self._scalar_comparison(result, self, other, op=">=")) return result else: return NotImplemented def __lt__(self, other): if isinstance(other, Array): result = self._new_like_me(_BOOL_DTYPE) result.add_event( self._array_comparison(result, self, other, op="<")) return result elif np.isscalar(other): result = self._new_like_me(_BOOL_DTYPE) result.add_event( self._scalar_comparison(result, self, other, op="<")) return result else: return NotImplemented def __gt__(self, other): if isinstance(other, Array): result = self._new_like_me(_BOOL_DTYPE) result.add_event( self._array_comparison(result, self, other, op=">")) return result elif np.isscalar(other): result = self._new_like_me(_BOOL_DTYPE) result.add_event( self._scalar_comparison(result, self, other, op=">")) return result else: return NotImplemented # }}} # {{{ complex-valued business @property def real(self): """ .. versionadded:: 2012.1 """ if self.dtype.kind == "c": result = self._new_like_me(self.dtype.type(0).real.dtype) result.add_event( self._real(result, self)) return result else: return self @property def imag(self): """ .. versionadded:: 2012.1 """ if self.dtype.kind == "c": result = self._new_like_me(self.dtype.type(0).real.dtype) result.add_event( self._imag(result, self)) return result else: return zeros_like(self) def conj(self): """ .. versionadded:: 2012.1 """ if self.dtype.kind == "c": result = self._new_like_me() result.add_event(self._conj(result, self)) return result else: return self conjugate = conj # }}} # {{{ event management def add_event(self, evt): """Add *evt* to :attr:`events`. If :attr:`events` is too long, this method may implicitly wait for a subset of :attr:`events` and clear them from the list. """ n_wait = 4 self.events.append(evt) if len(self.events) > 3*n_wait: wait_events = self.events[:n_wait] cl.wait_for_events(wait_events) del self.events[:n_wait] def finish(self): """Wait for the entire contents of :attr:`events`, clear it.""" if self.events: cl.wait_for_events(self.events) del self.events[:] # }}} # {{{ views def reshape(self, *shape, **kwargs): """Returns an array containing the same data with a new shape.""" order = kwargs.pop("order", "C") if kwargs: raise TypeError("unexpected keyword arguments: %s" % list(kwargs.keys())) if order not in "CF": raise ValueError("order must be either 'C' or 'F'") # TODO: add more error-checking, perhaps # FIXME: The following is overly conservative. As long as we don't change # our memory footprint, we're good. # if not self.flags.forc: # raise RuntimeError("only contiguous arrays may " # "be used as arguments to this operation") if isinstance(shape[0], tuple) or isinstance(shape[0], list): shape = tuple(shape[0]) if -1 in shape: shape = list(shape) idx = shape.index(-1) size = -reduce(lambda x, y: x * y, shape, 1) if size == 0: shape[idx] = 0 else: shape[idx] = self.size // size if builtins.any(s < 0 for s in shape): raise ValueError("can only specify one unknown dimension") shape = tuple(shape) if shape == self.shape: return self._new_with_changes( data=self.base_data, offset=self.offset, shape=shape, strides=self.strides) import operator size = reduce(operator.mul, shape, 1) if size != self.size: raise ValueError("total size of new array must be unchanged") if self.size == 0: return self._new_with_changes( data=None, offset=0, shape=shape, strides=( _f_contiguous_strides(self.dtype.itemsize, shape) if order == "F" else _c_contiguous_strides(self.dtype.itemsize, shape) )) # {{{ determine reshaped strides # copied and translated from # https://github.com/numpy/numpy/blob/4083883228d61a3b571dec640185b5a5d983bf59/numpy/core/src/multiarray/shape.c # noqa: E501 newdims = shape newnd = len(newdims) # Remove axes with dimension 1 from the old array. They have no effect # but would need special cases since their strides do not matter. olddims = [] oldstrides = [] for oi in range(len(self.shape)): s = self.shape[oi] if s != 1: olddims.append(s) oldstrides.append(self.strides[oi]) oldnd = len(olddims) newstrides = [-1]*len(newdims) # oi to oj and ni to nj give the axis ranges currently worked with oi = 0 oj = 1 ni = 0 nj = 1 while ni < newnd and oi < oldnd: np = newdims[ni] op = olddims[oi] while np != op: if np < op: # Misses trailing 1s, these are handled later np *= newdims[nj] nj += 1 else: op *= olddims[oj] oj += 1 # Check whether the original axes can be combined for ok in range(oi, oj-1): if order == "F": if oldstrides[ok+1] != olddims[ok]*oldstrides[ok]: raise ValueError("cannot reshape without copy") else: # C order if (oldstrides[ok] != olddims[ok+1]*oldstrides[ok+1]): raise ValueError("cannot reshape without copy") # Calculate new strides for all axes currently worked with if order == "F": newstrides[ni] = oldstrides[oi] for nk in range(ni+1, nj): newstrides[nk] = newstrides[nk - 1]*newdims[nk - 1] else: # C order newstrides[nj - 1] = oldstrides[oj - 1] for nk in range(nj-1, ni, -1): newstrides[nk - 1] = newstrides[nk]*newdims[nk] ni = nj nj += 1 oi = oj oj += 1 # Set strides corresponding to trailing 1s of the new shape. if ni >= 1: last_stride = newstrides[ni - 1] else: last_stride = self.dtype.itemsize if order == "F": last_stride *= newdims[ni - 1] for nk in range(ni, len(shape)): newstrides[nk] = last_stride # }}} return self._new_with_changes( data=self.base_data, offset=self.offset, shape=shape, strides=tuple(newstrides)) def ravel(self, order="C"): """Returns flattened array containing the same data.""" return self.reshape(self.size, order=order) def view(self, dtype=None): """Returns view of array with the same data. If *dtype* is different from current dtype, the actual bytes of memory will be reinterpreted. """ if dtype is None: dtype = self.dtype old_itemsize = self.dtype.itemsize itemsize = np.dtype(dtype).itemsize from pytools import argmin2 min_stride_axis = argmin2( (axis, abs(stride)) for axis, stride in enumerate(self.strides)) if self.shape[min_stride_axis] * old_itemsize % itemsize != 0: raise ValueError("new type not compatible with array") new_shape = ( self.shape[:min_stride_axis] + (self.shape[min_stride_axis] * old_itemsize // itemsize,) + self.shape[min_stride_axis+1:]) new_strides = ( self.strides[:min_stride_axis] + (self.strides[min_stride_axis] * itemsize // old_itemsize,) + self.strides[min_stride_axis+1:]) return self._new_with_changes( self.base_data, self.offset, shape=new_shape, dtype=dtype, strides=new_strides) def squeeze(self): """Returns a view of the array with dimensions of length 1 removed. .. versionadded:: 2015.2 """ new_shape = tuple(dim for dim in self.shape if dim > 1) new_strides = tuple( self.strides[i] for i, dim in enumerate(self.shape) if dim > 1) return self._new_with_changes( self.base_data, self.offset, shape=new_shape, strides=new_strides) def transpose(self, axes=None): """Permute the dimensions of an array. :arg axes: list of ints, optional. By default, reverse the dimensions, otherwise permute the axes according to the values given. :returns: :class:`Array` A view of the array with its axes permuted. .. versionadded:: 2015.2 """ if axes is None: axes = range(self.ndim-1, -1, -1) if len(axes) != len(self.shape): raise ValueError("axes don't match array") new_shape = [self.shape[axes[i]] for i in range(len(axes))] new_strides = [self.strides[axes[i]] for i in range(len(axes))] return self._new_with_changes( self.base_data, self.offset, shape=tuple(new_shape), strides=tuple(new_strides)) @property def T(self): # noqa: N802 """ .. versionadded:: 2015.2 """ return self.transpose() # }}} def map_to_host(self, queue=None, flags=None, is_blocking=True, wait_for=None): """If *is_blocking*, return a :class:`numpy.ndarray` corresponding to the same memory as *self*. If *is_blocking* is not true, return a tuple ``(ary, evt)``, where *ary* is the above-mentioned array. The host array is obtained using :func:`pyopencl.enqueue_map_buffer`. See there for further details. :arg flags: A combination of :class:`pyopencl.map_flags`. Defaults to read-write. .. versionadded :: 2013.2 """ if flags is None: flags = cl.map_flags.READ | cl.map_flags.WRITE if wait_for is None: wait_for = [] ary, evt = cl.enqueue_map_buffer( queue or self.queue, self.base_data, flags, self.offset, self.shape, self.dtype, strides=self.strides, wait_for=wait_for + self.events, is_blocking=is_blocking) if is_blocking: return ary else: return ary, evt # {{{ getitem/setitem def __getitem__(self, index): """ .. versionadded:: 2013.1 """ if isinstance(index, Array): if index.dtype.kind not in ("i", "u"): raise TypeError( "fancy indexing is only allowed with integers") if len(index.shape) != 1: raise NotImplementedError( "multidimensional fancy indexing is not supported") if len(self.shape) != 1: raise NotImplementedError( "fancy indexing into a multi-d array is not supported") return take(self, index) if not isinstance(index, tuple): index = (index,) new_shape = [] new_offset = self.offset new_strides = [] seen_ellipsis = False index_axis = 0 array_axis = 0 while index_axis < len(index): index_entry = index[index_axis] if array_axis > len(self.shape): raise IndexError("too many axes in index") if isinstance(index_entry, slice): start, stop, idx_stride = index_entry.indices( self.shape[array_axis]) array_stride = self.strides[array_axis] new_shape.append((abs(stop-start)-1)//abs(idx_stride)+1) new_strides.append(idx_stride*array_stride) new_offset += array_stride*start index_axis += 1 array_axis += 1 elif isinstance(index_entry, (int, np.integer)): array_shape = self.shape[array_axis] if index_entry < 0: index_entry += array_shape if not (0 <= index_entry < array_shape): raise IndexError( "subindex in axis %d out of range" % index_axis) new_offset += self.strides[array_axis]*index_entry index_axis += 1 array_axis += 1 elif index_entry is Ellipsis: index_axis += 1 remaining_index_count = len(index) - index_axis new_array_axis = len(self.shape) - remaining_index_count if new_array_axis < array_axis: raise IndexError("invalid use of ellipsis in index") while array_axis < new_array_axis: new_shape.append(self.shape[array_axis]) new_strides.append(self.strides[array_axis]) array_axis += 1 if seen_ellipsis: raise IndexError( "more than one ellipsis not allowed in index") seen_ellipsis = True elif index_entry is np.newaxis: new_shape.append(1) new_strides.append(0) index_axis += 1 else: raise IndexError("invalid subindex in axis %d" % index_axis) while array_axis < len(self.shape): new_shape.append(self.shape[array_axis]) new_strides.append(self.strides[array_axis]) array_axis += 1 return self._new_with_changes( self.base_data, offset=new_offset, shape=tuple(new_shape), strides=tuple(new_strides)) def setitem(self, subscript, value, queue=None, wait_for=None): """Like :meth:`__setitem__`, but with the ability to specify a *queue* and *wait_for*. .. versionadded:: 2013.1 .. versionchanged:: 2013.2 Added *wait_for*. """ queue = queue or self.queue or value.queue if wait_for is None: wait_for = [] wait_for = wait_for + self.events if isinstance(subscript, Array): if subscript.dtype.kind not in ("i", "u"): raise TypeError( "fancy indexing is only allowed with integers") if len(subscript.shape) != 1: raise NotImplementedError( "multidimensional fancy indexing is not supported") if len(self.shape) != 1: raise NotImplementedError( "fancy indexing into a multi-d array is not supported") multi_put([value], subscript, out=[self], queue=queue, wait_for=wait_for) return subarray = self[subscript] if not subarray.size: # This prevents errors about mismatched strides that neither we # nor numpy worry about in the empty case. return if isinstance(value, np.ndarray): if subarray.shape == value.shape and subarray.strides == value.strides: self.add_event( cl.enqueue_copy(queue, subarray.base_data, value, dst_offset=subarray.offset, wait_for=wait_for)) return else: value = to_device(queue, value, self.allocator) if isinstance(value, Array): if len(subarray.shape) != len(value.shape): raise NotImplementedError("broadcasting is not " "supported in __setitem__") if subarray.shape != value.shape: raise ValueError("cannot assign between arrays of " "differing shapes") if subarray.strides != value.strides: raise NotImplementedError("cannot assign between arrays of " "differing strides") self.add_event( self._copy(subarray, value, queue=queue, wait_for=wait_for)) else: # Let's assume it's a scalar subarray.fill(value, queue=queue, wait_for=wait_for) def __setitem__(self, subscript, value): """Set the slice of *self* identified *subscript* to *value*. *value* is allowed to be: * A :class:`Array` of the same :attr:`shape` and (for now) :attr:`strides`, but with potentially different :attr:`dtype`. * A :class:`numpy.ndarray` of the same :attr:`shape` and (for now) :attr:`strides`, but with potentially different :attr:`dtype`. * A scalar. Non-scalar broadcasting is not currently supported. .. versionadded:: 2013.1 """ self.setitem(subscript, value) # }}} # }}} # {{{ creation helpers def as_strided(ary, shape=None, strides=None): """Make an :class:`Array` from the given array with the given shape and strides. """ # undocumented for the moment if shape is None: shape = ary.shape if strides is None: strides = ary.strides return Array(ary.queue, shape, ary.dtype, allocator=ary.allocator, data=ary.data, strides=strides) class _same_as_transfer: # noqa: N801 pass def to_device(queue, ary, allocator=None, async_=None, array_queue=_same_as_transfer, **kwargs): """Return a :class:`Array` that is an exact copy of the :class:`numpy.ndarray` instance *ary*. :arg array_queue: The :class:`~pyopencl.CommandQueue` which will be stored in the resulting array. Useful to make sure there is no implicit queue associated with the array by passing *None*. See :class:`Array` for the meaning of *allocator*. .. versionchanged:: 2015.2 *array_queue* argument was added. .. versionchanged:: 2017.2.1 Python 3.7 makes ``async`` a reserved keyword. On older Pythons, we will continue to accept *async* as a parameter, however this should be considered deprecated. *async_* is the new, official spelling. """ # {{{ handle 'async' deprecation async_arg = kwargs.pop("async", None) if async_arg is not None: if async_ is not None: raise TypeError("may not specify both 'async' and 'async_'") async_ = async_arg if async_ is None: async_ = False if kwargs: raise TypeError("extra keyword arguments specified: %s" % ", ".join(kwargs)) # }}} if ary.dtype == object: raise RuntimeError("to_device does not work on object arrays.") if array_queue is _same_as_transfer: first_arg = queue else: first_arg = queue.context result = Array(first_arg, ary.shape, ary.dtype, allocator=allocator, strides=ary.strides) result.set(ary, async_=async_, queue=queue) return result empty = Array def zeros(queue, shape, dtype, order="C", allocator=None): """Same as :func:`empty`, but the :class:`Array` is zero-initialized before being returned. .. versionchanged:: 2011.1 *context* argument was deprecated. """ result = Array(None, shape, dtype, order=order, allocator=allocator, _context=queue.context, _queue=queue) result._zero_fill() return result def empty_like(ary, queue=_copy_queue, allocator=None): """Make a new, uninitialized :class:`Array` having the same properties as *other_ary*. """ return ary._new_with_changes(data=None, offset=0, queue=queue, allocator=allocator) def zeros_like(ary): """Make a new, zero-initialized :class:`Array` having the same properties as *other_ary*. """ result = ary._new_like_me() result._zero_fill() return result @dataclass class _ArangeInfo: start: Optional[int] = None stop: Optional[int] = None step: Optional[int] = None dtype: Optional["np.dtype"] = None allocator: Optional[Any] = None @elwise_kernel_runner def _arange_knl(result, start, step): return elementwise.get_arange_kernel( result.context, result.dtype) def arange(queue, *args, **kwargs): """arange(queue, [start, ] stop [, step], **kwargs) Create a :class:`Array` filled with numbers spaced *step* apart, starting from *start* and ending at *stop*. If not given, *start* defaults to 0, *step* defaults to 1. For floating point arguments, the length of the result is ``ceil((stop - start)/step)``. This rule may result in the last element of the result being greater than *stop*. *dtype* is a required keyword argument. .. versionchanged:: 2011.1 *context* argument was deprecated. .. versionchanged:: 2011.2 *allocator* keyword argument was added. """ # {{{ argument processing # Yuck. Thanks, numpy developers. ;) explicit_dtype = False inf = _ArangeInfo() if isinstance(args[-1], np.dtype): inf.dtype = args[-1] args = args[:-1] explicit_dtype = True argc = len(args) if argc == 0: raise ValueError("stop argument required") elif argc == 1: inf.stop = args[0] elif argc == 2: inf.start = args[0] inf.stop = args[1] elif argc == 3: inf.start = args[0] inf.stop = args[1] inf.step = args[2] else: raise ValueError("too many arguments") admissible_names = ["start", "stop", "step", "dtype", "allocator"] for k, v in kwargs.items(): if k in admissible_names: if getattr(inf, k) is None: setattr(inf, k, v) if k == "dtype": explicit_dtype = True else: raise ValueError(f"may not specify '{k}' by position and keyword") else: raise ValueError(f"unexpected keyword argument '{k}'") if inf.start is None: inf.start = 0 if inf.step is None: inf.step = 1 if inf.dtype is None: inf.dtype = np.array([inf.start, inf.stop, inf.step]).dtype # }}} # {{{ actual functionality dtype = np.dtype(inf.dtype) start = dtype.type(inf.start) step = dtype.type(inf.step) stop = dtype.type(inf.stop) if not explicit_dtype: raise TypeError("arange requires a dtype argument") from math import ceil size = ceil((stop-start)/step) result = Array(queue, (size,), dtype, allocator=inf.allocator) result.add_event(_arange_knl(result, start, step, queue=queue)) # }}} return result # }}} # {{{ take/put/concatenate/diff/(h?stack) @elwise_kernel_runner def _take(result, ary, indices): return elementwise.get_take_kernel( result.context, result.dtype, indices.dtype) def take(a, indices, out=None, queue=None, wait_for=None): """Return the :class:`Array` ``[a[indices[0]], ..., a[indices[n]]]``. For the moment, *a* must be a type that can be bound to a texture. """ queue = queue or a.queue if out is None: out = type(a)(queue, indices.shape, a.dtype, allocator=a.allocator) assert len(indices.shape) == 1 out.add_event( _take(out, a, indices, queue=queue, wait_for=wait_for)) return out def multi_take(arrays, indices, out=None, queue=None): if not len(arrays): return [] assert len(indices.shape) == 1 from pytools import single_valued a_dtype = single_valued(a.dtype for a in arrays) a_allocator = arrays[0].dtype context = indices.context queue = queue or indices.queue vec_count = len(arrays) if out is None: out = [ type(arrays[i])( context, queue, indices.shape, a_dtype, allocator=a_allocator) for i in range(vec_count)] else: if len(out) != len(arrays): raise ValueError("out and arrays must have the same length") chunk_size = builtins.min(vec_count, 10) def make_func_for_chunk_size(chunk_size): knl = elementwise.get_take_kernel( indices.context, a_dtype, indices.dtype, vec_count=chunk_size) knl.set_block_shape(*indices._block) return knl knl = make_func_for_chunk_size(chunk_size) for start_i in range(0, len(arrays), chunk_size): chunk_slice = slice(start_i, start_i+chunk_size) if start_i + chunk_size > vec_count: knl = make_func_for_chunk_size(vec_count-start_i) gs, ls = indices._get_sizes(queue, knl.get_work_group_info( cl.kernel_work_group_info.WORK_GROUP_SIZE, queue.device)) wait_for_this = ( *indices.events, *[evt for i in arrays[chunk_slice] for evt in i.events], *[evt for o in out[chunk_slice] for evt in o.events]) evt = knl(queue, gs, ls, indices.data, *[o.data for o in out[chunk_slice]], *[i.data for i in arrays[chunk_slice]], *[indices.size], wait_for=wait_for_this) for o in out[chunk_slice]: o.add_event(evt) return out def multi_take_put(arrays, dest_indices, src_indices, dest_shape=None, out=None, queue=None, src_offsets=None): if not len(arrays): return [] from pytools import single_valued a_dtype = single_valued(a.dtype for a in arrays) a_allocator = arrays[0].allocator context = src_indices.context queue = queue or src_indices.queue vec_count = len(arrays) if out is None: out = [type(arrays[i])(queue, dest_shape, a_dtype, allocator=a_allocator) for i in range(vec_count)] else: if a_dtype != single_valued(o.dtype for o in out): raise TypeError("arrays and out must have the same dtype") if len(out) != vec_count: raise ValueError("out and arrays must have the same length") if src_indices.dtype != dest_indices.dtype: raise TypeError( "src_indices and dest_indices must have the same dtype") if len(src_indices.shape) != 1: raise ValueError("src_indices must be 1D") if src_indices.shape != dest_indices.shape: raise ValueError( "src_indices and dest_indices must have the same shape") if src_offsets is None: src_offsets_list = [] else: src_offsets_list = src_offsets if len(src_offsets) != vec_count: raise ValueError( "src_indices and src_offsets must have the same length") max_chunk_size = 10 chunk_size = builtins.min(vec_count, max_chunk_size) def make_func_for_chunk_size(chunk_size): return elementwise.get_take_put_kernel(context, a_dtype, src_indices.dtype, with_offsets=src_offsets is not None, vec_count=chunk_size) knl = make_func_for_chunk_size(chunk_size) for start_i in range(0, len(arrays), chunk_size): chunk_slice = slice(start_i, start_i+chunk_size) if start_i + chunk_size > vec_count: knl = make_func_for_chunk_size(vec_count-start_i) gs, ls = src_indices._get_sizes(queue, knl.get_work_group_info( cl.kernel_work_group_info.WORK_GROUP_SIZE, queue.device)) wait_for_this = ( *dest_indices.events, *src_indices.events, *[evt for i in arrays[chunk_slice] for evt in i.events], *[evt for o in out[chunk_slice] for evt in o.events]) evt = knl(queue, gs, ls, *out[chunk_slice], dest_indices, src_indices, *arrays[chunk_slice], *src_offsets_list[chunk_slice], src_indices.size, wait_for=wait_for_this) for o in out[chunk_slice]: o.add_event(evt) return out def multi_put(arrays, dest_indices, dest_shape=None, out=None, queue=None, wait_for=None): if not len(arrays): return [] from pytools import single_valued a_dtype = single_valued(a.dtype for a in arrays) a_allocator = arrays[0].allocator context = dest_indices.context queue = queue or dest_indices.queue if wait_for is None: wait_for = [] wait_for = wait_for + dest_indices.events vec_count = len(arrays) if out is None: out = [type(arrays[i])(queue, dest_shape, a_dtype, allocator=a_allocator) for i in range(vec_count)] else: if a_dtype != single_valued(o.dtype for o in out): raise TypeError("arrays and out must have the same dtype") if len(out) != vec_count: raise ValueError("out and arrays must have the same length") if len(dest_indices.shape) != 1: raise ValueError("dest_indices must be 1D") chunk_size = builtins.min(vec_count, 10) # array of bools to specify whether the array of same index in this chunk # will be filled with a single value. use_fill = np.ndarray((chunk_size,), dtype=np.uint8) array_lengths = np.ndarray((chunk_size,), dtype=np.int64) def make_func_for_chunk_size(chunk_size): knl = elementwise.get_put_kernel( context, a_dtype, dest_indices.dtype, vec_count=chunk_size) return knl knl = make_func_for_chunk_size(chunk_size) for start_i in range(0, len(arrays), chunk_size): chunk_slice = slice(start_i, start_i+chunk_size) for fill_idx, ary in enumerate(arrays[chunk_slice]): # If there is only one value in the values array for this src array # in the chunk then fill every index in `dest_idx` array with it. use_fill[fill_idx] = 1 if ary.size == 1 else 0 array_lengths[fill_idx] = len(ary) # Copy the populated `use_fill` array to a buffer on the device. use_fill_cla = to_device(queue, use_fill) array_lengths_cla = to_device(queue, array_lengths) if start_i + chunk_size > vec_count: knl = make_func_for_chunk_size(vec_count-start_i) gs, ls = dest_indices._get_sizes(queue, knl.get_work_group_info( cl.kernel_work_group_info.WORK_GROUP_SIZE, queue.device)) wait_for_this = ( *wait_for, *[evt for i in arrays[chunk_slice] for evt in i.events], *[evt for o in out[chunk_slice] for evt in o.events]) evt = knl(queue, gs, ls, *out[chunk_slice], dest_indices, *arrays[chunk_slice], use_fill_cla, array_lengths_cla, dest_indices.size, wait_for=wait_for_this) for o in out[chunk_slice]: o.add_event(evt) return out def concatenate(arrays, axis=0, queue=None, allocator=None): """ .. versionadded:: 2013.1 .. note:: The returned array is of the same type as the first array in the list. """ if not arrays: raise ValueError("need at least one array to concatenate") # {{{ find properties of result array shape = None for i_ary, ary in enumerate(arrays): queue = queue or ary.queue allocator = allocator or ary.allocator if shape is None: # first array shape = list(ary.shape) else: if len(ary.shape) != len(shape): raise ValueError( f"{i_ary}-th array has different number of axes: " f"expected {len(ary.shape)}, got {len(shape)})") ary_shape_list = list(ary.shape) if (ary_shape_list[:axis] != shape[:axis] or ary_shape_list[axis+1:] != shape[axis+1:]): raise ValueError( f"{i_ary}-th array has residual not matching other arrays") # pylint: disable=unsupported-assignment-operation shape[axis] += ary.shape[axis] # }}} shape = tuple(shape) dtype = np.result_type(*[ary.dtype for ary in arrays]) if __debug__: if builtins.any(type(ary) != type(arrays[0]) # noqa: E721 for ary in arrays[1:]): warn("Elements of 'arrays' not of the same type, returning " "an instance of the type of arrays[0]", stacklevel=2) result = arrays[0].__class__(queue, shape, dtype, allocator=allocator) full_slice = (slice(None),) * len(shape) base_idx = 0 for ary in arrays: my_len = ary.shape[axis] result.setitem( full_slice[:axis] + (slice(base_idx, base_idx+my_len),) + full_slice[axis+1:], ary) base_idx += my_len return result @elwise_kernel_runner def _diff(result, array): return elementwise.get_diff_kernel(array.context, array.dtype) def diff(array, queue=None, allocator=None): """ .. versionadded:: 2013.2 """ if len(array.shape) != 1: raise ValueError("multi-D arrays are not supported") n, = array.shape queue = queue or array.queue allocator = allocator or array.allocator result = array.__class__(queue, (n-1,), array.dtype, allocator=allocator) event1 = _diff(result, array, queue=queue) result.add_event(event1) return result def hstack(arrays, queue=None): if len(arrays) == 0: raise ValueError("need at least one array to hstack") if queue is None: for ary in arrays: if ary.queue is not None: queue = ary.queue break from pytools import all_equal, single_valued if not all_equal(len(ary.shape) for ary in arrays): raise ValueError("arguments must all have the same number of axes") lead_shape = single_valued(ary.shape[:-1] for ary in arrays) w = builtins.sum(ary.shape[-1] for ary in arrays) if __debug__: if builtins.any(type(ary) != type(arrays[0]) # noqa: E721 for ary in arrays[1:]): warn("Elements of 'arrays' not of the same type, returning " "an instance of the type of arrays[0]", stacklevel=2) result = arrays[0].__class__(queue, (*lead_shape, w), arrays[0].dtype, allocator=arrays[0].allocator) index = 0 for ary in arrays: result[..., index:index+ary.shape[-1]] = ary index += ary.shape[-1] return result def stack(arrays, axis=0, queue=None): """ Join a sequence of arrays along a new axis. :arg arrays: A sequence of :class:`Array`. :arg axis: Index of the dimension of the new axis in the result array. Can be -1, for the new axis to be last dimension. :returns: :class:`Array` """ if not arrays: raise ValueError("need at least one array to stack") input_shape = arrays[0].shape input_ndim = arrays[0].ndim axis = input_ndim if axis == -1 else axis if queue is None: for ary in arrays: if ary.queue is not None: queue = ary.queue break if not builtins.all(ary.shape == input_shape for ary in arrays[1:]): raise ValueError("arrays must have the same shape") if not (0 <= axis <= input_ndim): raise ValueError("invalid axis") if (axis == 0 and not builtins.all( ary.flags.c_contiguous for ary in arrays)): # pyopencl.Array.__setitem__ does not support non-contiguous assignments raise NotImplementedError if (axis == input_ndim and not builtins.all( ary.flags.f_contiguous for ary in arrays)): # pyopencl.Array.__setitem__ does not support non-contiguous assignments raise NotImplementedError result_shape = input_shape[:axis] + (len(arrays),) + input_shape[axis:] if __debug__: if builtins.any(type(ary) != type(arrays[0]) # noqa: E721 for ary in arrays[1:]): warn("Elements of 'arrays' not of the same type, returning " "an instance of the type of arrays[0]", stacklevel=2) result = arrays[0].__class__(queue, result_shape, np.result_type(*(ary.dtype for ary in arrays)), # TODO: reconsider once arrays support # non-contiguous assignments order="C" if axis == 0 else "F", allocator=arrays[0].allocator) for i, ary in enumerate(arrays): idx = (slice(None),)*axis + (i,) + (slice(None),)*(input_ndim-axis) result[idx] = ary return result # }}} # {{{ shape manipulation def transpose(a, axes=None): """Permute the dimensions of an array. :arg a: :class:`Array` :arg axes: list of ints, optional. By default, reverse the dimensions, otherwise permute the axes according to the values given. :returns: :class:`Array` A view of the array with its axes permuted. """ return a.transpose(axes) def reshape(a, shape): """Gives a new shape to an array without changing its data. .. versionadded:: 2015.2 """ return a.reshape(shape) # }}} # {{{ conditionals @elwise_kernel_runner def _if_positive(result, criterion, then_, else_): return elementwise.get_if_positive_kernel( result.context, criterion.dtype, then_.dtype, is_then_array=isinstance(then_, Array), is_else_array=isinstance(else_, Array), is_then_scalar=then_.shape == (), is_else_scalar=else_.shape == (), ) def if_positive(criterion, then_, else_, out=None, queue=None): """Return an array like *then_*, which, for the element at index *i*, contains *then_[i]* if *criterion[i]>0*, else *else_[i]*. """ is_then_scalar = isinstance(then_, SCALAR_CLASSES) is_else_scalar = isinstance(else_, SCALAR_CLASSES) if isinstance(criterion, SCALAR_CLASSES) and is_then_scalar and is_else_scalar: result = np.where(criterion, then_, else_) if out is not None: out[...] = result return out return result if is_then_scalar: then_ = np.array(then_) if is_else_scalar: else_ = np.array(else_) if then_.dtype != else_.dtype: raise ValueError( f"dtypes do not match: then_ is '{then_.dtype}' and " f"else_ is '{else_.dtype}'") if then_.shape == () and else_.shape == (): pass elif then_.shape != () and else_.shape != (): if not (criterion.shape == then_.shape == else_.shape): raise ValueError( f"shapes do not match: 'criterion' has shape {criterion.shape}" f", 'then_' has shape {then_.shape} and 'else_' has shape " f"{else_.shape}") elif then_.shape == (): if criterion.shape != else_.shape: raise ValueError( f"shapes do not match: 'criterion' has shape {criterion.shape}" f" and 'else_' has shape {else_.shape}") elif else_.shape == (): if criterion.shape != then_.shape: raise ValueError( f"shapes do not match: 'criterion' has shape {criterion.shape}" f" and 'then_' has shape {then_.shape}") else: raise AssertionError() if out is None: if then_.shape != (): out = empty_like( then_, criterion.queue, allocator=criterion.allocator) else: # Use same strides as criterion cr_byte_strides = np.array(criterion.strides, dtype=np.int64) cr_item_strides = cr_byte_strides // criterion.dtype.itemsize out_strides = tuple(cr_item_strides*then_.dtype.itemsize) out = type(criterion)( criterion.queue, criterion.shape, then_.dtype, allocator=criterion.allocator, strides=out_strides) event1 = _if_positive(out, criterion, then_, else_, queue=queue) out.add_event(event1) return out # }}} # {{{ minimum/maximum @elwise_kernel_runner def _minimum_maximum_backend(out, a, b, minmax): from pyopencl.elementwise import get_minmaximum_kernel return get_minmaximum_kernel(out.context, minmax, out.dtype, a.dtype if isinstance(a, Array) else np.dtype(type(a)), b.dtype if isinstance(b, Array) else np.dtype(type(b)), elementwise.get_argument_kind(a), elementwise.get_argument_kind(b)) def maximum(a, b, out=None, queue=None): """Return the elementwise maximum of *a* and *b*.""" a_is_scalar = np.isscalar(a) b_is_scalar = np.isscalar(b) if a_is_scalar and b_is_scalar: result = np.maximum(a, b) if out is not None: out[...] = result return out return result queue = queue or a.queue or b.queue if out is None: out_dtype = _get_common_dtype(a, b, queue) if not a_is_scalar: out = a._new_like_me(out_dtype, queue) elif not b_is_scalar: out = b._new_like_me(out_dtype, queue) out.add_event(_minimum_maximum_backend(out, a, b, queue=queue, minmax="max")) return out def minimum(a, b, out=None, queue=None): """Return the elementwise minimum of *a* and *b*.""" a_is_scalar = np.isscalar(a) b_is_scalar = np.isscalar(b) if a_is_scalar and b_is_scalar: result = np.minimum(a, b) if out is not None: out[...] = result return out return result queue = queue or a.queue or b.queue if out is None: out_dtype = _get_common_dtype(a, b, queue) if not a_is_scalar: out = a._new_like_me(out_dtype, queue) elif not b_is_scalar: out = b._new_like_me(out_dtype, queue) out.add_event(_minimum_maximum_backend(out, a, b, queue=queue, minmax="min")) return out # }}} # {{{ logical ops def _logical_op(x1, x2, out, operator, queue=None): # NOTE: Copied from pycuda.gpuarray assert operator in ["&&", "||"] if np.isscalar(x1) and np.isscalar(x2): if out is None: out = empty(queue, shape=(), dtype=np.int8) if operator == "&&": out[:] = np.logical_and(x1, x2) else: out[:] = np.logical_or(x1, x2) elif np.isscalar(x1) or np.isscalar(x2): scalar_arg, = (x for x in (x1, x2) if np.isscalar(x)) ary_arg, = (x for x in (x1, x2) if not np.isscalar(x)) queue = queue or ary_arg.queue allocator = ary_arg.allocator if not isinstance(ary_arg, Array): raise ValueError("logical_and can take either scalar or Array" " as inputs") out = out or ary_arg._new_like_me(dtype=np.int8) assert out.shape == ary_arg.shape and out.dtype == np.int8 knl = elementwise.get_array_scalar_binop_kernel( queue.context, operator, out.dtype, ary_arg.dtype, np.dtype(type(scalar_arg)) ) elwise_kernel_runner(lambda *args, **kwargs: knl)(out, ary_arg, scalar_arg) else: if not (isinstance(x1, Array) and isinstance(x2, Array)): raise ValueError("logical_or/logical_and can take either scalar" " or Arrays as inputs") if x1.shape != x2.shape: raise NotImplementedError("Broadcasting not supported") queue = queue or x1.queue or x2.queue allocator = x1.allocator or x2.allocator if out is None: out = empty(queue, allocator=allocator, shape=x1.shape, dtype=np.int8) assert out.shape == x1.shape and out.dtype == np.int8 knl = elementwise.get_array_binop_kernel( queue.context, operator, out.dtype, x1.dtype, x2.dtype) elwise_kernel_runner(lambda *args, **kwargs: knl)(out, x1, x2) return out def logical_and(x1, x2, /, out=None, queue=None): """ Returns the element-wise logical AND of *x1* and *x2*. """ return _logical_op(x1, x2, out, "&&", queue=queue) def logical_or(x1, x2, /, out=None, queue=None): """ Returns the element-wise logical OR of *x1* and *x2*. """ return _logical_op(x1, x2, out, "||", queue=queue) def logical_not(x, /, out=None, queue=None): """ Returns the element-wise logical NOT of *x*. """ if np.isscalar(x): out = out or empty(queue, shape=(), dtype=np.int8) out[:] = np.logical_not(x) else: queue = queue or x.queue out = out or empty(queue, shape=x.shape, dtype=np.int8, allocator=x.allocator) knl = elementwise.get_logical_not_kernel(queue.context, x.dtype) elwise_kernel_runner(lambda *args, **kwargs: knl)(out, x) return out # }}} # {{{ reductions def sum(a, dtype=None, queue=None, slice=None, initial=np._NoValue): """ .. versionadded:: 2011.1 """ if initial is not np._NoValue and not isinstance(initial, SCALAR_CLASSES): raise ValueError("'initial' is not a scalar") if dtype is not None: dtype = np.dtype(dtype) from pyopencl.reduction import get_sum_kernel krnl = get_sum_kernel(a.context, dtype, a.dtype) result, event1 = krnl(a, queue=queue, slice=slice, wait_for=a.events, return_event=True) result.add_event(event1) # NOTE: neutral element in `get_sum_kernel` is 0 by default if initial is not np._NoValue: result += a.dtype.type(initial) return result def any(a, queue=None, wait_for=None): if len(a) == 0: return _BOOL_DTYPE.type(False) return a.any(queue=queue, wait_for=wait_for) def all(a, queue=None, wait_for=None): if len(a) == 0: return _BOOL_DTYPE.type(True) return a.all(queue=queue, wait_for=wait_for) def dot(a, b, dtype=None, queue=None, slice=None): """ .. versionadded:: 2011.1 """ if dtype is not None: dtype = np.dtype(dtype) from pyopencl.reduction import get_dot_kernel krnl = get_dot_kernel(a.context, dtype, a.dtype, b.dtype) result, event1 = krnl(a, b, queue=queue, slice=slice, wait_for=a.events + b.events, return_event=True) result.add_event(event1) return result def vdot(a, b, dtype=None, queue=None, slice=None): """Like :func:`numpy.vdot`. .. versionadded:: 2013.1 """ if dtype is not None: dtype = np.dtype(dtype) from pyopencl.reduction import get_dot_kernel krnl = get_dot_kernel(a.context, dtype, a.dtype, b.dtype, conjugate_first=True) result, event1 = krnl(a, b, queue=queue, slice=slice, wait_for=a.events + b.events, return_event=True) result.add_event(event1) return result def subset_dot(subset, a, b, dtype=None, queue=None, slice=None): """ .. versionadded:: 2011.1 """ if dtype is not None: dtype = np.dtype(dtype) from pyopencl.reduction import get_subset_dot_kernel krnl = get_subset_dot_kernel( a.context, dtype, subset.dtype, a.dtype, b.dtype) result, event1 = krnl(subset, a, b, queue=queue, slice=slice, wait_for=subset.events + a.events + b.events, return_event=True) result.add_event(event1) return result def _make_minmax_kernel(what): def f(a, queue=None, initial=np._NoValue): if isinstance(a, SCALAR_CLASSES): return np.array(a).dtype.type(a) if len(a) == 0: if initial is np._NoValue: raise ValueError( f"zero-size array to reduction '{what}' " "which has no identity") else: return initial if initial is not np._NoValue and not isinstance(initial, SCALAR_CLASSES): raise ValueError("'initial' is not a scalar") from pyopencl.reduction import get_minmax_kernel krnl = get_minmax_kernel(a.context, what, a.dtype) result, event1 = krnl(a, queue=queue, wait_for=a.events, return_event=True) result.add_event(event1) if initial is not np._NoValue: initial = a.dtype.type(initial) if what == "min": result = minimum(result, initial, queue=queue) elif what == "max": result = maximum(result, initial, queue=queue) else: raise ValueError(f"unknown minmax reduction type: '{what}'") return result return f min = _make_minmax_kernel("min") min.__name__ = "min" min.__doc__ = """ .. versionadded:: 2011.1 """ max = _make_minmax_kernel("max") max.__name__ = "max" max.__doc__ = """ .. versionadded:: 2011.1 """ def _make_subset_minmax_kernel(what): def f(subset, a, queue=None, slice=None): from pyopencl.reduction import get_subset_minmax_kernel krnl = get_subset_minmax_kernel(a.context, what, a.dtype, subset.dtype) result, event1 = krnl(subset, a, queue=queue, slice=slice, wait_for=a.events + subset.events, return_event=True) result.add_event(event1) return result return f subset_min = _make_subset_minmax_kernel("min") subset_min.__doc__ = """.. versionadded:: 2011.1""" subset_max = _make_subset_minmax_kernel("max") subset_max.__doc__ = """.. versionadded:: 2011.1""" # }}} # {{{ scans def cumsum(a, output_dtype=None, queue=None, wait_for=None, return_event=False): # undocumented for now """ .. versionadded:: 2013.1 """ if output_dtype is None: output_dtype = a.dtype else: output_dtype = np.dtype(output_dtype) if wait_for is None: wait_for = [] result = a._new_like_me(output_dtype) from pyopencl.scan import get_cumsum_kernel krnl = get_cumsum_kernel(a.context, a.dtype, output_dtype) evt = krnl(a, result, queue=queue, wait_for=wait_for + a.events) result.add_event(evt) if return_event: return evt, result else: return result # }}} # vim: foldmethod=marker pyopencl-2025.1/pyopencl/bitonic_sort.py0000644000000000000000000001752514332717401015277 0ustar00__copyright__ = """ Copyright (c) 2011, Eric Bainville Copyright (c) 2015, Ilya Efimoff All rights reserved. """ # based on code at http://www.bealto.com/gpu-sorting_intro.html __license__ = """ Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. """ from functools import reduce from operator import mul from typing import ClassVar, Dict from mako.template import Template from pytools import memoize_method import pyopencl as cl import pyopencl.bitonic_sort_templates as _tmpl from pyopencl.tools import dtype_to_ctype def _is_power_of_2(n): from pyopencl.tools import bitlog2 return n == 0 or 2**bitlog2(n) == n class BitonicSort: """Sort an array (or one axis of one) using a sorting network. Will only work if the axis of the array to be sorted has a length that is a power of 2. .. versionadded:: 2015.2 .. seealso:: :class:`pyopencl.algorithm.RadixSort` .. automethod:: __call__ """ kernels_srcs: ClassVar[Dict[str, str]] = { "B2": _tmpl.ParallelBitonic_B2, "B4": _tmpl.ParallelBitonic_B4, "B8": _tmpl.ParallelBitonic_B8, "B16": _tmpl.ParallelBitonic_B16, "C4": _tmpl.ParallelBitonic_C4, "BL": _tmpl.ParallelBitonic_Local, "BLO": _tmpl.ParallelBitonic_Local_Optim, "PML": _tmpl.ParallelMerge_Local } def __init__(self, context): self.context = context def __call__(self, arr, idx=None, queue=None, wait_for=None, axis=0): """ :arg arr: the array to be sorted. Will be overwritten with the sorted array. :arg idx: an array of indices to be tracked along with the sorting of *arr* :arg queue: a :class:`pyopencl.CommandQueue`, defaults to the array's queue if None :arg wait_for: a list of :class:`pyopencl.Event` instances or None :arg axis: the axis of the array by which to sort :returns: a tuple (sorted_array, event) """ if queue is None: queue = arr.queue if wait_for is None: wait_for = [] wait_for = wait_for + arr.events last_evt = cl.enqueue_marker(queue, wait_for=wait_for) if arr.shape[axis] == 0: return arr, last_evt if not _is_power_of_2(arr.shape[axis]): raise ValueError("sorted array axis length must be a power of 2") if idx is None: argsort = 0 else: argsort = 1 run_queue = self.sort_b_prepare_wl( argsort, arr.dtype, idx.dtype if idx is not None else None, arr.shape, axis) knl, nt, wg, aux = run_queue[0] if idx is not None: if aux: last_evt = knl( queue, (nt,), wg, arr.data, idx.data, cl.LocalMemory( _tmpl.LOCAL_MEM_FACTOR*wg[0]*arr.dtype.itemsize), cl.LocalMemory( _tmpl.LOCAL_MEM_FACTOR*wg[0]*idx.dtype.itemsize), wait_for=[last_evt]) for knl, nt, wg, _ in run_queue[1:]: last_evt = knl( queue, (nt,), wg, arr.data, idx.data, wait_for=[last_evt]) else: if aux: last_evt = knl( queue, (nt,), wg, arr.data, cl.LocalMemory( _tmpl.LOCAL_MEM_FACTOR*wg[0]*4*arr.dtype.itemsize), wait_for=[last_evt]) for knl, nt, wg, _ in run_queue[1:]: last_evt = knl(queue, (nt,), wg, arr.data, wait_for=[last_evt]) return arr, last_evt @memoize_method def get_program(self, letter, argsort, params): defstpl = Template(_tmpl.defines) defs = defstpl.render( NS="\\", argsort=argsort, inc=params[0], dir=params[1], dtype=params[2], idxtype=params[3], dsize=params[4], nsize=params[5]) kid = Template(self.kernels_srcs[letter]).render(argsort=argsort) prg = cl.Program(self.context, defs + kid).build() return prg @memoize_method def sort_b_prepare_wl(self, argsort, key_dtype, idx_dtype, shape, axis): key_ctype = dtype_to_ctype(key_dtype) if idx_dtype is None: idx_ctype = "uint" # Dummy else: idx_ctype = dtype_to_ctype(idx_dtype) run_queue = [] ds = int(shape[axis]) size = reduce(mul, shape) ndim = len(shape) ns = reduce(mul, shape[(axis+1):]) if axis < ndim-1 else 1 ds = int(shape[axis]) allowb4 = True allowb8 = True allowb16 = True dev = self.context.devices[0] # {{{ find workgroup size wg = min(ds, dev.max_work_group_size) available_lmem = dev.local_mem_size while True: lmem_size = _tmpl.LOCAL_MEM_FACTOR*wg*key_dtype.itemsize if argsort: lmem_size += _tmpl.LOCAL_MEM_FACTOR*wg*idx_dtype.itemsize if lmem_size + 512 > available_lmem: wg //= 2 if not wg: raise RuntimeError( "too little local memory available on '%s'" % dev) else: break # }}} length = wg >> 1 prg = self.get_program( "BLO", argsort, (1, 1, key_ctype, idx_ctype, ds, ns)) run_queue.append((prg.run, size, (wg,), True)) while length < ds: inc = length while inc > 0: ninc = 0 direction = length << 1 if allowb16 and inc >= 8 and ninc == 0: letter = "B16" ninc = 4 elif allowb8 and inc >= 4 and ninc == 0: letter = "B8" ninc = 3 elif allowb4 and inc >= 2 and ninc == 0: letter = "B4" ninc = 2 elif inc >= 0: letter = "B2" ninc = 1 else: raise AssertionError("Should not happen") nthreads = size >> ninc prg = self.get_program(letter, argsort, (inc, direction, key_ctype, idx_ctype, ds, ns)) run_queue.append((prg.run, nthreads, None, False,)) inc >>= ninc length <<= 1 return run_queue pyopencl-2025.1/pyopencl/bitonic_sort_templates.py0000644000000000000000000003742714332717401017360 0ustar00__copyright__ = """ Copyright (c) 2011, Eric Bainville Copyright (c) 2015, Ilya Efimoff All rights reserved. """ __license__ = """ Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. """ LOCAL_MEM_FACTOR = 1 # {{{ defines defines = """//CL// % if dtype == "double": #if __OPENCL_C_VERSION__ < 120 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif % endif typedef ${dtype} data_t; typedef ${idxtype} idx_t; #if CONFIG_USE_VALUE #define getKey(a) ((a).x) #define getValue(a) ((a).y) #define makeData(k,v) ((${dtype}2)((k),(v))) #else #define getKey(a) (a) #define getValue(a) (0) #define makeData(k,v) (k) #endif #ifndef BLOCK_FACTOR #define BLOCK_FACTOR 1 #endif #define inc ${inc} #define hinc ${inc>>1} //Half inc #define qinc ${inc>>2} //Quarter inc #define einc ${inc>>3} //Eighth of inc #define dir ${dir} % if argsort: #define ORDER(a,b,ay,by) { bool swap = reverse ^ (getKey(a)>1); // thread index int gt = get_global_id(0) / (dsize>>1); int low = t & (inc - 1); // low order bits (below INC) int i = (t<<1) - low; // insert 0 at position INC int gi = i/dsize; // block index bool reverse = ((dir & i) == 0);// ^ (gi%2); // asc/desc order int offset = (gt/nsize)*nsize*dsize+(gt%nsize); data += i*nsize + offset; // translate to first value % if argsort: index += i*nsize + offset; // translate to first value % endif // Load data data_t x0 = data[ 0]; data_t x1 = data[inc*nsize]; % if argsort: // Load index idx_t i0 = index[ 0]; idx_t i1 = index[inc*nsize]; % endif // Sort % if argsort: ORDER(x0,x1,i0,i1) % else: ORDER(x0,x1) % endif // Store data data[0 ] = x0; data[inc*nsize] = x1; % if argsort: // Store index index[ 0] = i0; index[inc*nsize] = i1; % endif } """ # }}} # {{{ B4 ParallelBitonic_B4 = """//CL// // N/4 threads //ParallelBitonic_B4 __kernel void run(__global data_t * data\\ % if argsort: , __global idx_t * index) % else: ) % endif { int t = get_global_id(0) % (dsize>>2); // thread index int gt = get_global_id(0) / (dsize>>2); int low = t & (hinc - 1); // low order bits (below INC) int i = ((t - low) << 2) + low; // insert 00 at position INC bool reverse = ((dir & i) == 0); // asc/desc order int offset = (gt/nsize)*nsize*dsize+(gt%nsize); data += i*nsize + offset; // translate to first value % if argsort: index += i*nsize + offset; // translate to first value % endif // Load data data_t x0 = data[ 0]; data_t x1 = data[ hinc*nsize]; data_t x2 = data[2*hinc*nsize]; data_t x3 = data[3*hinc*nsize]; % if argsort: // Load index idx_t i0 = index[ 0]; idx_t i1 = index[ hinc*nsize]; idx_t i2 = index[2*hinc*nsize]; idx_t i3 = index[3*hinc*nsize]; % endif // Sort % if argsort: ORDER(x0,x2,i0,i2) ORDER(x1,x3,i1,i3) ORDER(x0,x1,i0,i1) ORDER(x2,x3,i2,i3) % else: ORDER(x0,x2) ORDER(x1,x3) ORDER(x0,x1) ORDER(x2,x3) % endif // Store data data[ 0] = x0; data[ hinc*nsize] = x1; data[2*hinc*nsize] = x2; data[3*hinc*nsize] = x3; % if argsort: // Store index index[ 0] = i0; index[ hinc*nsize] = i1; index[2*hinc*nsize] = i2; index[3*hinc*nsize] = i3; % endif } """ # }}} # {{{ B8 ParallelBitonic_B8 = """//CL// // N/8 threads //ParallelBitonic_B8 __kernel void run(__global data_t * data\\ % if argsort: , __global idx_t * index) % else: ) % endif { int t = get_global_id(0) % (dsize>>3); // thread index int gt = get_global_id(0) / (dsize>>3); int low = t & (qinc - 1); // low order bits (below INC) int i = ((t - low) << 3) + low; // insert 000 at position INC bool reverse = ((dir & i) == 0); // asc/desc order int offset = (gt/nsize)*nsize*dsize+(gt%nsize); data += i*nsize + offset; // translate to first value % if argsort: index += i*nsize + offset; // translate to first value % endif // Load data_t x[8]; % if argsort: idx_t y[8]; % endif for (int k=0;k<8;k++) x[k] = data[k*qinc*nsize]; % if argsort: for (int k=0;k<8;k++) y[k] = index[k*qinc*nsize]; % endif // Sort % if argsort: B8V(x,y,0) % else: B8V(x,0) % endif // Store for (int k=0;k<8;k++) data[k*qinc*nsize] = x[k]; % if argsort: for (int k=0;k<8;k++) index[k*qinc*nsize] = y[k]; % endif } """ # }}} # {{{ B16 ParallelBitonic_B16 = """//CL// // N/16 threads //ParallelBitonic_B16 __kernel void run(__global data_t * data\\ % if argsort: , __global idx_t * index) % else: ) % endif { int t = get_global_id(0) % (dsize>>4); // thread index int gt = get_global_id(0) / (dsize>>4); int low = t & (einc - 1); // low order bits (below INC) int i = ((t - low) << 4) + low; // insert 0000 at position INC bool reverse = ((dir & i) == 0); // asc/desc order int offset = (gt/nsize)*nsize*dsize+(gt%nsize); data += i*nsize + offset; // translate to first value % if argsort: index += i*nsize + offset; // translate to first value % endif // Load data_t x[16]; % if argsort: idx_t y[16]; % endif for (int k=0;k<16;k++) x[k] = data[k*einc*nsize]; % if argsort: for (int k=0;k<16;k++) y[k] = index[k*einc*nsize]; % endif // Sort % if argsort: B16V(x,y,0) % else: B16V(x,0) % endif // Store for (int k=0;k<16;k++) data[k*einc*nsize] = x[k]; % if argsort: for (int k=0;k<16;k++) index[k*einc*nsize] = y[k]; % endif } """ # }}} # {{{ C4 # IF YOU RE-ENABLE THIS, YOU NEED TO ADJUST LOCAL_MEM_FACTOR TO 4 ParallelBitonic_C4 = """//CL// //ParallelBitonic_C4 __kernel void run\\ % if argsort: (__global data_t * data, __global idx_t * index, __local data_t * aux, __local idx_t * auy) % else: (__global data_t * data, __local data_t * aux) % endif { int t = get_global_id(0); // thread index int wgBits = 4*get_local_size(0) - 1; // bit mask to get index in local memory AUX (size is 4*WG) int linc,low,i; bool reverse; data_t x[4]; % if argsort: idx_t y[4]; % endif // First iteration, global input, local output linc = hinc; low = t & (linc - 1); // low order bits (below INC) i = ((t - low) << 2) + low; // insert 00 at position INC reverse = ((dir & i) == 0); // asc/desc order for (int k=0;k<4;k++) x[k] = data[i+k*linc]; % if argsort: for (int k=0;k<4;k++) y[k] = index[i+k*linc]; B4V(x,y,0); for (int k=0;k<4;k++) auy[(i+k*linc) & wgBits] = y[k]; % else: B4V(x,0); % endif for (int k=0;k<4;k++) aux[(i+k*linc) & wgBits] = x[k]; barrier(CLK_LOCAL_MEM_FENCE); // Internal iterations, local input and output for ( ;linc>1;linc>>=2) { low = t & (linc - 1); // low order bits (below INC) i = ((t - low) << 2) + low; // insert 00 at position INC reverse = ((dir & i) == 0); // asc/desc order for (int k=0;k<4;k++) x[k] = aux[(i+k*linc) & wgBits]; % if argsort: for (int k=0;k<4;k++) y[k] = auy[(i+k*linc) & wgBits]; B4V(x,y,0); barrier(CLK_LOCAL_MEM_FENCE); for (int k=0;k<4;k++) auy[(i+k*linc) & wgBits] = y[k]; % else: B4V(x,0); barrier(CLK_LOCAL_MEM_FENCE); % endif for (int k=0;k<4;k++) aux[(i+k*linc) & wgBits] = x[k]; barrier(CLK_LOCAL_MEM_FENCE); } // Final iteration, local input, global output, INC=1 i = t << 2; reverse = ((dir & i) == 0); // asc/desc order for (int k=0;k<4;k++) x[k] = aux[(i+k) & wgBits]; % if argsort: for (int k=0;k<4;k++) y[k] = auy[(i+k) & wgBits]; B4V(x,y,0); for (int k=0;k<4;k++) index[i+k] = y[k]; % else: B4V(x,0); % endif for (int k=0;k<4;k++) data[i+k] = x[k]; } """ # noqa: E501 # }}} # {{{ local merge ParallelMerge_Local = """//CL// // N threads, WG is workgroup size. Sort WG input blocks in each workgroup. __kernel void run(__global const data_t * in,__global data_t * out,__local data_t * aux) { int i = get_local_id(0); // index in workgroup int wg = get_local_size(0); // workgroup size = block size, power of 2 // Move IN, OUT to block start int offset = get_group_id(0) * wg; in += offset; out += offset; // Load block in AUX[WG] aux[i] = in[i]; barrier(CLK_LOCAL_MEM_FENCE); // make sure AUX is entirely up to date // Now we will merge sub-sequences of length 1,2,...,WG/2 for (int length=1;length0;pinc>>=1) // increment for dichotomic search { int j = sibling+pos+pinc-1; data_t jKey = getKey(aux[j]); bool smaller = (jKey < iKey) || ( jKey == iKey && j < i ); pos += (smaller)?pinc:0; pos = min(pos,length); } int bits = 2*length-1; // mask for destination int dest = ((ii + pos) & bits) | (i & ~bits); // destination index in merged sequence barrier(CLK_LOCAL_MEM_FENCE); aux[dest] = iData; barrier(CLK_LOCAL_MEM_FENCE); } // Write output out[i] = aux[i]; } """ # noqa: E501 # }}} # {{{ ParallelBitonic_Local = """//CL// // N threads, WG is workgroup size. Sort WG input blocks in each workgroup. __kernel void run(__global const data_t * in,__global data_t * out,__local data_t * aux) { int i = get_local_id(0); // index in workgroup int wg = get_local_size(0); // workgroup size = block size, power of 2 // Move IN, OUT to block start int offset = get_group_id(0) * wg; in += offset; out += offset; // Load block in AUX[WG] aux[i] = in[i]; barrier(CLK_LOCAL_MEM_FENCE); // make sure AUX is entirely up to date // Loop on sorted sequence length for (int length=1;length0;pinc>>=1) { int j = i + pinc; // sibling to compare data_t iData = aux[i]; uint iKey = getKey(iData); data_t jData = aux[j]; uint jKey = getKey(jData); bool smaller = (jKey < iKey) || ( jKey == iKey && j < i ); bool swap = smaller ^ (j < i) ^ direction; barrier(CLK_LOCAL_MEM_FENCE); aux[i] = (swap)?jData:iData; barrier(CLK_LOCAL_MEM_FENCE); } } // Write output out[i] = aux[i]; } """ # }}} # {{{ A ParallelBitonic_A = """//CL// __kernel void ParallelBitonic_A(__global const data_t * in) { int i = get_global_id(0); // thread index int j = i ^ inc; // sibling to compare // Load values at I and J data_t iData = in[i]; uint iKey = getKey(iData); data_t jData = in[j]; uint jKey = getKey(jData); // Compare bool smaller = (jKey < iKey) || ( jKey == iKey && j < i ); bool swap = smaller ^ (j < i) ^ ((dir & i) != 0); // Store in[i] = (swap)?jData:iData; } """ # }}} # {{{ local optim ParallelBitonic_Local_Optim = """//CL// __kernel void run\\ % if argsort: (__global data_t * data, __global idx_t * index, __local data_t * aux, __local idx_t * auy) % else: (__global data_t * data, __local data_t * aux) % endif { int t = get_global_id(0) % dsize; // thread index int gt = get_global_id(0) / dsize; int offset = (gt/nsize)*nsize*dsize+(gt%nsize); int i = get_local_id(0); // index in workgroup int wg = get_local_size(0); // workgroup size = block size, power of 2 // Move IN, OUT to block start //int offset = get_group_id(0) * wg; data += offset; // Load block in AUX[WG] data_t iData = data[t*nsize]; aux[i] = iData; % if argsort: index += offset; // Load block in AUY[WG] idx_t iidx = index[t*nsize]; auy[i] = iidx; % endif barrier(CLK_LOCAL_MEM_FENCE); // make sure AUX is entirely up to date // Loop on sorted sequence length for (int pwg=1;pwg<=wg;pwg<<=1){ int loffset = pwg*(i/pwg); int ii = i%pwg; for (int length=1;length0;pinc>>=1){ int j = ii ^ pinc; // sibling to compare data_t jData = aux[loffset+j]; % if argsort: idx_t jidx = auy[loffset+j]; % endif data_t iKey = getKey(iData); data_t jKey = getKey(jData); bool smaller = (jKey < iKey) || ( jKey == iKey && j < ii ); bool swap = smaller ^ (ii>j) ^ direction; iData = (swap)?jData:iData; // update iData % if argsort: iidx = (swap)?jidx:iidx; // update iidx % endif barrier(CLK_LOCAL_MEM_FENCE); aux[loffset+ii] = iData; % if argsort: auy[loffset+ii] = iidx; % endif barrier(CLK_LOCAL_MEM_FENCE); } } } // Write output data[t*nsize] = iData; % if argsort: index[t*nsize] = iidx; % endif } """ # noqa: E501 # }}} # vim: filetype=pyopencl:fdm=marker pyopencl-2025.1/pyopencl/cache.py0000644000000000000000000003720014332717401013634 0ustar00"""PyOpenCL compiler cache.""" __copyright__ = "Copyright (C) 2011 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import logging import os import re import sys from dataclasses import dataclass from typing import List, Optional, Tuple import pyopencl._cl as _cl logger = logging.getLogger(__name__) import hashlib new_hash = hashlib.md5 def _erase_dir(directory): from os import listdir, rmdir, unlink from os.path import join for name in listdir(directory): unlink(join(directory, name)) rmdir(directory) def update_checksum(checksum, obj): if isinstance(obj, str): checksum.update(obj.encode("utf8")) else: checksum.update(obj) # {{{ cleanup class CleanupBase: pass class CleanupManager(CleanupBase): def __init__(self): self.cleanups = [] def register(self, c): self.cleanups.insert(0, c) def clean_up(self): for c in self.cleanups: c.clean_up() def error_clean_up(self): for c in self.cleanups: c.error_clean_up() class CacheLockManager(CleanupBase): def __init__(self, cleanup_m, cache_dir): if cache_dir is not None: self.lock_file = os.path.join(cache_dir, "lock") attempts = 0 while True: try: self.fd = os.open(self.lock_file, os.O_CREAT | os.O_WRONLY | os.O_EXCL) break except OSError: pass # This value was chosen based on the py-filelock package: # https://github.com/tox-dev/py-filelock/blob/a6c8fabc4192fa7a4ae19b1875ee842ec5eb4f61/src/filelock/_api.py#L113 # When running pyopencl in an application with multiple ranks # that share a cache_dir, higher timeouts can lead to # application stalls even with low numbers of ranks. # cf. https://github.com/inducer/pyopencl/pull/504 wait_time_seconds = 0.05 # Warn every 10 seconds if not able to acquire lock warn_attempts = int(10/wait_time_seconds) # Exit after 60 seconds if not able to acquire lock exit_attempts = int(60/wait_time_seconds) from time import sleep sleep(wait_time_seconds) attempts += 1 if attempts % warn_attempts == 0: from warnings import warn warn( f"Could not obtain cache lock--delete '{self.lock_file}' " "if necessary", stacklevel=2) if attempts > exit_attempts: raise RuntimeError("waited more than one minute " "on the lock file '%s'" "--something is wrong" % self.lock_file) cleanup_m.register(self) def clean_up(self): os.close(self.fd) os.unlink(self.lock_file) def error_clean_up(self): pass class ModuleCacheDirManager(CleanupBase): def __init__(self, cleanup_m, path): from os import mkdir self.path = path try: mkdir(self.path) cleanup_m.register(self) self.existed = False except OSError: self.existed = True def sub(self, n): from os.path import join return join(self.path, n) def reset(self): _erase_dir(self.path) os.mkdir(self.path) def clean_up(self): pass def error_clean_up(self): _erase_dir(self.path) # }}} # {{{ #include dependency handling C_INCLUDE_RE = re.compile(r'^\s*\#\s*include\s+[<"](.+)[">]\s*$', re.MULTILINE) def get_dependencies(src, include_path): result = {} from os.path import join, realpath def _inner(src): for match in C_INCLUDE_RE.finditer(src): included = match.group(1) found = False for ipath in include_path: included_file_name = realpath(join(ipath, included)) if included_file_name not in result: try: src_file = open(included_file_name) except OSError: continue try: included_src = src_file.read() finally: src_file.close() # prevent infinite recursion if some header file appears to # include itself result[included_file_name] = None checksum = new_hash() update_checksum(checksum, included_src) _inner(included_src) result[included_file_name] = ( os.stat(included_file_name).st_mtime, checksum.hexdigest(), ) found = True break # stop searching the include path if not found: pass _inner(src) result = [(name, *vals) for name, vals in result.items()] result.sort() return result def get_file_md5sum(fname): checksum = new_hash() inf = open(fname) try: contents = inf.read() finally: inf.close() update_checksum(checksum, contents) return checksum.hexdigest() def check_dependencies(deps): for name, date, md5sum in deps: try: possibly_updated = os.stat(name).st_mtime != date except OSError: return False else: if possibly_updated and md5sum != get_file_md5sum(name): return False return True # }}} # {{{ key generation def get_device_cache_id(device): from pyopencl.version import VERSION platform = device.platform return (VERSION, platform.vendor, platform.name, platform.version, device.vendor, device.name, device.version, device.driver_version) def get_cache_key(device, options_bytes, src): checksum = new_hash() update_checksum(checksum, src) update_checksum(checksum, options_bytes) update_checksum(checksum, str(get_device_cache_id(device))) return checksum.hexdigest() # }}} def retrieve_from_cache(cache_dir, cache_key): class _InvalidInfoFileError(RuntimeError): pass from os.path import isdir, join module_cache_dir = join(cache_dir, cache_key) if not isdir(module_cache_dir): return None cleanup_m = CleanupManager() try: try: CacheLockManager(cleanup_m, cache_dir) mod_cache_dir_m = ModuleCacheDirManager(cleanup_m, module_cache_dir) info_path = mod_cache_dir_m.sub("info") binary_path = mod_cache_dir_m.sub("binary") # {{{ load info file try: from pickle import load try: info_file = open(info_path, "rb") except OSError as err: raise _InvalidInfoFileError() from err try: try: info = load(info_file) except EOFError as err: raise _InvalidInfoFileError() from err finally: info_file.close() except _InvalidInfoFileError: mod_cache_dir_m.reset() from warnings import warn warn( "PyOpenCL encountered an invalid info file for " f"cache key '{cache_key}'", stacklevel=2) return None # }}} # {{{ load binary binary_file = open(binary_path, "rb") try: binary = binary_file.read() finally: binary_file.close() # }}} if check_dependencies(info.dependencies): return binary, info.log else: mod_cache_dir_m.reset() except Exception: cleanup_m.error_clean_up() raise finally: cleanup_m.clean_up() # {{{ top-level driver @dataclass(frozen=True) class _SourceInfo: dependencies: List[Tuple[str, ...]] log: Optional[str] def _create_built_program_from_source_cached(ctx, src, options_bytes, devices, cache_dir, include_path): from os.path import join if cache_dir is None: import platformdirs # Determine the cache directory in the same way as pytools.PersistentDict, # which PyOpenCL uses for invoker caches. if sys.platform == "darwin" and os.getenv("XDG_CACHE_HOME") is not None: # platformdirs does not handle XDG_CACHE_HOME on macOS # https://github.com/platformdirs/platformdirs/issues/269 cache_dir = join(os.getenv("XDG_CACHE_HOME"), "pyopencl") else: cache_dir = platformdirs.user_cache_dir("pyopencl", "pyopencl") cache_dir = join(cache_dir, "pyopencl-compiler-cache-v2-py{}".format( ".".join(str(i) for i in sys.version_info))) os.makedirs(cache_dir, exist_ok=True) if devices is None: devices = ctx.devices cache_keys = [get_cache_key(device, options_bytes, src) for device in devices] binaries = [] to_be_built_indices = [] logs = [] for i, (_device, cache_key) in enumerate(zip(devices, cache_keys)): cache_result = retrieve_from_cache(cache_dir, cache_key) if cache_result is None: logger.debug("build program: binary cache miss (key: %s)", cache_key) to_be_built_indices.append(i) binaries.append(None) logs.append(None) else: logger.debug("build program: binary cache hit (key: %s)", cache_key) binary, log = cache_result binaries.append(binary) logs.append(log) message = (75*"="+"\n").join( f"Build on {dev} succeeded, but said:\n\n{log}" for dev, log in zip(devices, logs) if log is not None and log.strip()) if message: from pyopencl import compiler_output compiler_output( "Built kernel retrieved from cache. Original from-source " "build had warnings:\n"+message) # {{{ build on the build-needing devices, in one go result = None already_built = False was_cached = not to_be_built_indices if to_be_built_indices: # defeat implementation caches: from uuid import uuid4 src = src + "\n\n__constant int pyopencl_defeat_cache_%s = 0;" % ( uuid4().hex) logger.debug( "build program: start building program from source on %s", ", ".join(str(devices[i]) for i in to_be_built_indices)) prg = _cl._Program(ctx, src) prg.build(options_bytes, [devices[i] for i in to_be_built_indices]) logger.debug("build program: from-source build complete") prg_devs = prg.get_info(_cl.program_info.DEVICES) prg_bins = prg.get_info(_cl.program_info.BINARIES) prg_logs = prg._get_build_logs() for dest_index in to_be_built_indices: dev = devices[dest_index] src_index = prg_devs.index(dev) binaries[dest_index] = prg_bins[src_index] _, logs[dest_index] = prg_logs[src_index] if len(to_be_built_indices) == len(devices): # Important special case: if code for all devices was built, # then we may simply use the program that we just built as the # final result. result = prg already_built = True if result is None: result = _cl._Program(ctx, devices, binaries) # }}} # {{{ save binaries to cache if to_be_built_indices: cleanup_m = CleanupManager() try: try: CacheLockManager(cleanup_m, cache_dir) for i in to_be_built_indices: cache_key = cache_keys[i] binary = binaries[i] mod_cache_dir_m = ModuleCacheDirManager(cleanup_m, join(cache_dir, cache_key)) info_path = mod_cache_dir_m.sub("info") binary_path = mod_cache_dir_m.sub("binary") source_path = mod_cache_dir_m.sub("source.cl") with open(source_path, "w") as outf: outf.write(src) with open(binary_path, "wb") as outf: outf.write(binary) from pickle import dump info_file = open(info_path, "wb") dump(_SourceInfo( dependencies=get_dependencies(src, include_path), log=logs[i]), info_file) info_file.close() except Exception: cleanup_m.error_clean_up() raise finally: cleanup_m.clean_up() # }}} return result, already_built, was_cached def create_built_program_from_source_cached(ctx, src, options_bytes, devices=None, cache_dir=None, include_path=None): try: was_cached = False already_built = False if cache_dir is not False: prg, already_built, was_cached = \ _create_built_program_from_source_cached( ctx, src, options_bytes, devices, cache_dir, include_path=include_path) if was_cached and not already_built: prg.build(options_bytes, devices) already_built = True else: prg = _cl._Program(ctx, src) except Exception as e: from pyopencl import Error build_program_failure = (isinstance(e, Error) and e.code == _cl.status_code.BUILD_PROGRAM_FAILURE) # pylint:disable=no-member # Mac error on intel CPU driver: can't build from cached version. # If we get a build_program_failure from the cached version then # build from source instead, otherwise report the failure. if build_program_failure and not was_cached: raise if not build_program_failure: from traceback import format_exc from warnings import warn warn( "PyOpenCL compiler caching failed with an exception:\n" f"[begin exception]\n{format_exc()}[end exception]", stacklevel=2) prg = _cl._Program(ctx, src) was_cached = False already_built = False if not already_built: prg.build(options_bytes, devices) return prg, was_cached # }}} # vim: foldmethod=marker pyopencl-2025.1/pyopencl/capture_call.py0000644000000000000000000001307614332717401015234 0ustar00__copyright__ = "Copyright (C) 2013 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import numpy as np from pytools.py_codegen import Indentation, PythonCodeGenerator import pyopencl as cl def capture_kernel_call(kernel, output_file, queue, g_size, l_size, *args, **kwargs): try: source = kernel._source except AttributeError as err: raise RuntimeError("cannot capture call, kernel source not available") from err if source is None: raise RuntimeError("cannot capture call, kernel source not available") cg = PythonCodeGenerator() cg("# generated by pyopencl.capture_call") cg("") cg("import numpy as np") cg("import pyopencl as cl") cg("from base64 import b64decode") cg("from zlib import decompress") cg("mf = cl.mem_flags") cg("") cg('CODE = r"""//CL//') for line in source.split("\n"): cg(line) cg('"""') # {{{ invocation arg_data = [] cg("") cg("") cg("def main():") with Indentation(cg): cg("ctx = cl.create_some_context()") cg("queue = cl.CommandQueue(ctx)") cg("") kernel_args = [] for i, arg in enumerate(args): if isinstance(arg, cl.Buffer): buf = bytearray(arg.size) cl.enqueue_copy(queue, buf, arg) arg_data.append(("arg%d_data" % i, buf)) cg("arg%d = cl.Buffer(ctx, " "mf.READ_WRITE | cl.mem_flags.COPY_HOST_PTR," % i) cg(" hostbuf=decompress(b64decode(arg%d_data)))" % i) kernel_args.append("arg%d" % i) elif isinstance(arg, (int, float)): kernel_args.append(repr(arg)) elif isinstance(arg, np.integer): kernel_args.append("np.{}({})".format( arg.dtype.type.__name__, repr(int(arg)))) elif isinstance(arg, np.floating): kernel_args.append("np.{}({})".format( arg.dtype.type.__name__, repr(float(arg)))) elif isinstance(arg, np.complexfloating): kernel_args.append("np.{}({})".format( arg.dtype.type.__name__, repr(complex(arg)))) else: try: arg_buf = memoryview(arg) except Exception as err: raise RuntimeError("cannot capture: " "unsupported arg nr %d (0-based)" % i) from err arg_data.append(("arg%d_data" % i, arg_buf)) kernel_args.append("decompress(b64decode(arg%d_data))" % i) cg("") g_times_l = kwargs.get("g_times_l", False) if g_times_l: dim = max(len(g_size), len(l_size)) l_size = l_size + (1,) * (dim-len(l_size)) g_size = g_size + (1,) * (dim-len(g_size)) g_size = tuple( gs*ls for gs, ls in zip(g_size, l_size)) global_offset = kwargs.get("global_offset", None) if global_offset is not None: kernel_args.append("global_offset=%s" % repr(global_offset)) cg("prg = cl.Program(ctx, CODE).build()") cg("knl = prg.%s" % kernel.function_name) if hasattr(kernel, "_scalar_arg_dtypes"): def strify_dtype(d): if d is None: return "None" d = np.dtype(d) s = repr(d) if s.startswith("dtype"): s = "np."+s return s cg("knl.set_scalar_arg_dtypes((%s,))" % ", ".join( strify_dtype(dt) for dt in kernel._scalar_arg_dtypes)) cg("knl(queue, {}, {},".format(repr(g_size), repr(l_size))) cg(" %s)" % ", ".join(kernel_args)) cg("") cg("queue.finish()") # }}} # {{{ data from base64 import b64encode from zlib import compress cg("") line_len = 70 for name, val in arg_data: cg("%s = (" % name) with Indentation(cg): val = b64encode(compress(memoryview(val))).decode() i = 0 while i < len(val): cg(repr(val[i:i+line_len])) i += line_len cg(")") # }}} # {{{ file trailer cg("") cg('if __name__ == "__main__":') with Indentation(cg): cg("main()") cg("") cg("# vim: filetype=pyopencl") # }}} if isinstance(output_file, str): with open(output_file, "w") as outf: outf.write(cg.get()) else: output_file.write(cg.get()) pyopencl-2025.1/pyopencl/characterize/__init__.py0000644000000000000000000003401214332717401016772 0ustar00__copyright__ = "Copyright (C) 2009 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ from typing import Dict, Optional, Tuple from pytools import memoize import pyopencl as cl class CLCharacterizationWarning(UserWarning): pass @memoize def has_double_support(dev): for ext in dev.extensions.split(" "): if ext == "cl_khr_fp64": return True return False def has_amd_double_support(dev): """"Fix to allow incomplete amd double support in low end boards""" for ext in dev.extensions.split(" "): if ext == "cl_amd_fp64": return True return False def reasonable_work_group_size_multiple(dev, ctx=None): try: return dev.warp_size_nv except Exception: pass if ctx is None: ctx = cl.Context([dev]) prg = cl.Program(ctx, """ __kernel void knl(__global float *a) { a[get_global_id(0)] = 0; } """) prg.build() return prg.knl.get_work_group_info( cl.kernel_work_group_info.PREFERRED_WORK_GROUP_SIZE_MULTIPLE, dev) def nv_compute_capability(dev): """If *dev* is an Nvidia GPU :class:`pyopencl.Device`, return a tuple *(major, minor)* indicating the device's compute capability. """ try: return (dev.compute_capability_major_nv, dev.compute_capability_minor_nv) except Exception: return None def usable_local_mem_size(dev, nargs=None): """Return an estimate of the usable local memory size. :arg nargs: Number of 32-bit arguments passed. """ usable_local_mem_size = dev.local_mem_size nv_compute_cap = nv_compute_capability(dev) if (nv_compute_cap is not None and nv_compute_cap < (2, 0)): # pre-Fermi use local mem for parameter passing if nargs is None: # assume maximum usable_local_mem_size -= 256 else: usable_local_mem_size -= 4*nargs return usable_local_mem_size def simultaneous_work_items_on_local_access(dev): """Return the number of work items that access local memory simultaneously and thereby may conflict with each other. """ nv_compute_cap = nv_compute_capability(dev) if nv_compute_cap is not None: if nv_compute_cap < (2, 0): return 16 else: if nv_compute_cap >= (3, 0): from warnings import warn warn( f"Wildly guessing conflicting local access size on '{dev}'", CLCharacterizationWarning, stacklevel=2) return 32 if dev.type & cl.device_type.GPU: from warnings import warn warn( f"Wildly guessing conflicting local access size on '{dev}'", CLCharacterizationWarning, stacklevel=2) return 16 elif dev.type & cl.device_type.CPU: return 1 else: from warnings import warn warn( f"Wildly guessing conflicting local access size on '{dev}'", CLCharacterizationWarning, stacklevel=2) return 16 def local_memory_access_granularity(dev): """Return the number of bytes per bank in local memory.""" return 4 def local_memory_bank_count(dev): """Return the number of banks present in local memory. """ nv_compute_cap = nv_compute_capability(dev) if nv_compute_cap is not None: if nv_compute_cap < (2, 0): return 16 else: if nv_compute_cap >= (3, 0): from warnings import warn warn( f"Wildly guessing local memory bank count on '{dev}'", CLCharacterizationWarning, stacklevel=2) return 32 if dev.type & cl.device_type.GPU: from warnings import warn warn( f"Wildly guessing local memory bank count on '{dev}'", CLCharacterizationWarning, stacklevel=2) return 16 elif dev.type & cl.device_type.CPU: if dev.local_mem_type == cl.device_local_mem_type.GLOBAL: raise RuntimeError("asking for a bank count is " "meaningless for cache-based lmem") from warnings import warn warn( f"Wildly guessing conflicting local access size on '{dev}'", CLCharacterizationWarning, stacklevel=2) return 16 def why_not_local_access_conflict_free(dev, itemsize, array_shape, array_stored_shape=None): """ :param itemsize: size of accessed data in bytes :param array_shape: array dimensions, fastest-moving last (C order) :returns: a tuple (multiplicity, explanation), where *multiplicity* is the number of work items that will conflict on a bank when accessing local memory. *explanation* is a string detailing the found conflict. """ # FIXME: Treat 64-bit access on NV CC 2.x + correctly if array_stored_shape is None: array_stored_shape = array_shape rank = len(array_shape) array_shape = array_shape[::-1] array_stored_shape = array_stored_shape[::-1] gran = local_memory_access_granularity(dev) if itemsize != gran: from warnings import warn warn( f"Local conflict info might be inaccurate for itemsize != {gran}", CLCharacterizationWarning, stacklevel=2) sim_wi = simultaneous_work_items_on_local_access(dev) bank_count = local_memory_bank_count(dev) conflicts = [] for work_item_axis in range(rank): bank_accesses = {} for work_item_id in range(sim_wi): addr = 0 addr_mult = itemsize idx = [] left_over_idx = work_item_id for axis, (ax_size, ax_stor_size) in enumerate( zip(array_shape, array_stored_shape)): if axis >= work_item_axis: left_over_idx, ax_idx = divmod(left_over_idx, ax_size) addr += addr_mult*ax_idx idx.append(ax_idx) else: idx.append(0) addr_mult *= ax_stor_size if left_over_idx: # out-of-bounds, assume not taking place continue bank = (addr // gran) % bank_count bank_accesses.setdefault(bank, []).append( "w.item {} -> {}".format(work_item_id, idx[::-1])) conflict_multiplicity = max( len(acc) for acc in bank_accesses.values()) if conflict_multiplicity > 1: for bank, acc in bank_accesses.items(): if len(acc) == conflict_multiplicity: conflicts.append( (conflict_multiplicity, "%dx conflict on axis %d (from right, 0-based): " "%s access bank %d" % ( conflict_multiplicity, work_item_axis, ", ".join(acc), bank))) if conflicts: return max(conflicts) else: return 1, None def get_fast_inaccurate_build_options(dev): """Return a list of flags valid on device *dev* that enable fast, but potentially inaccurate floating point math. """ result = ["-cl-mad-enable", "-cl-fast-relaxed-math", "-cl-no-signed-zeros", ] if dev.vendor.startswith("Advanced Micro") or dev.vendor.startswith("NVIDIA"): result.append("-cl-strict-aliasing") return result def get_simd_group_size(dev, type_size): """Return an estimate of how many work items will be executed across SIMD lanes. This returns the size of what Nvidia calls a warp and what AMD calls a wavefront. Only refers to implicit SIMD. :arg type_size: number of bytes in vector entry type. """ try: return dev.warp_size_nv except Exception: pass lc_plat_vendor = dev.platform.vendor.lower() lc_dev_vendor = dev.vendor.lower() if "nvidia" in lc_plat_vendor or "nvidia" in lc_dev_vendor: return 32 if ("advanced micro" in lc_plat_vendor or "ati" in lc_plat_vendor or "advanced micro" in lc_dev_vendor or "ati" in lc_dev_vendor): if dev.type & cl.device_type.GPU: # Tomasz Rybak says, in response to reduction misbehaving on the AMD # 'Loveland' APU: # # Like in CUDA reduction bug (related to Fermi) it again seems # to be related to too eager concurrency when reducing results. # According to http://oscarbg.blogspot.com/2009/10/news-from-web.html # "Actually the wavefront size is only 64 for the highend cards(48XX, # 58XX, 57XX), but 32 for the middleend cards and 16 for the lowend # cards." # IMO we should use PREFERRED_WORK_GROUP_SIZE_MULTIPLE to get # non_sync_size. At the same size we lose SIMD CPU optimisation, # but I do not know for now how to fix those two at the same time. # Attached patch fixes problem on Loveland, not breaking anything on # NVIDIA ION. # This is therefore our best guess as to the SIMD group size. return reasonable_work_group_size_multiple(dev) elif dev.type & cl.device_type.CPU: return 1 else: raise RuntimeError("unexpected AMD device type") if dev.type & cl.device_type.CPU: # implicit assumption: Impl. will vectorize return 1 return None def get_pocl_version( platform: cl.Platform, fallback_value: Optional[Tuple[int, int]] = None ) -> Optional[Tuple[int, int]]: if platform.name != "Portable Computing Language": return None import re version = platform.version ver_match = re.match( r"^OpenCL [0-9.]+ [Pp]o[Cc][Ll] ([0-9]+)\.([0-9]+)", version) if ver_match is None: from warnings import warn warn(f"PoCL version number did not have expected format: '{version}'", stacklevel=2) return fallback_value else: return (int(ver_match.group(1)), int(ver_match.group(2))) _CHECK_FOR_POCL_ARG_COUNT_BUG_CACHE: Dict[cl.Device, bool] = {} def _check_for_pocl_arg_count_bug( dev: cl.Device, ctx: Optional[cl.Context] = None) -> bool: try: return _CHECK_FOR_POCL_ARG_COUNT_BUG_CACHE[dev] except KeyError: pass if ctx is None: build_ctx = cl.Context([dev]) else: build_ctx = ctx prg = cl.Program(build_ctx, """ struct two_things { long a; long b; }; __kernel void test_knl(struct two_things x) { } """).build() result = prg.test_knl.num_args == 2 _CHECK_FOR_POCL_ARG_COUNT_BUG_CACHE[dev] = result return result def has_struct_arg_count_bug(dev, ctx=None): """Checks whether the device is expected to have the `argument counting bug `__. """ if dev.platform.name == "Apple" and dev.type & cl.device_type.CPU: return "apple" if dev.platform.name == "Portable Computing Language": pocl_version = get_pocl_version(dev.platform, fallback_value=(0, 14)) if pocl_version <= (0, 13): return "pocl" elif pocl_version <= (0, 14) and _check_for_pocl_arg_count_bug(dev, ctx): return "pocl" return False # {{{ SVM capabilities def _may_have_svm(dev): has_svm = (dev.platform._get_cl_version() >= (2, 0) and cl.get_cl_header_version() >= (2, 0)) if dev.platform.name == "Portable Computing Language": has_svm = ( get_pocl_version(dev.platform) >= (1, 0) and cl.get_cl_header_version() >= (2, 0)) return has_svm def has_coarse_grain_buffer_svm(dev): return (_may_have_svm(dev) and bool(dev.svm_capabilities & cl.device_svm_capabilities.COARSE_GRAIN_BUFFER)) def has_fine_grain_buffer_svm(dev): return (_may_have_svm(dev) and bool(dev.svm_capabilities & cl.device_svm_capabilities.FINE_GRAIN_BUFFER)) def has_fine_grain_system_svm(dev): return (_may_have_svm(dev) and bool(dev.svm_capabilities & cl.device_svm_capabilities.FINE_GRAIN_SYSTEM)) def has_fine_grain_buffer_svm_atomics(dev): return has_fine_grain_buffer_svm(dev) and bool(dev.svm_capabilities & cl.device_svm_capabilities.ATOMICS) def has_fine_grain_system_svm_atomics(dev): return has_fine_grain_system_svm(dev) and bool(dev.svm_capabilities & cl.device_svm_capabilities.ATOMICS) # }}} def has_src_build_cache(dev: cl.Device) -> Optional[bool]: """ Return *True* if *dev* has internal support for caching builds from source, *False* if it doesn't, and *None* if unknown. """ if dev.platform.name == "Portable Computing Language": return True if nv_compute_capability(dev) is not None: return True if dev.platform.name == "AMD Accelerated Parallel Processing": return False return None # vim: foldmethod=marker pyopencl-2025.1/pyopencl/characterize/performance.py0000644000000000000000000001532214332717401017537 0ustar00__copyright__ = "Copyright (C) 2009 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import numpy as np import pyopencl as cl # {{{ timing helpers class Timer: def __init__(self, queue): self.queue = queue def start(self): pass def stop(self): pass def add_event(self, evt): pass def get_elapsed(self): pass class WallTimer(Timer): def start(self): from time import time self.queue.finish() self.start_time = time() def stop(self): from time import time self.queue.finish() self.end_time = time() def get_elapsed(self): return self.end_time-self.start_time def _get_time(queue, f, timer_factory=None, desired_duration=0.1, warmup_rounds=3): if timer_factory is None: timer_factory = WallTimer count = 1 while True: timer = timer_factory(queue) for _i in range(warmup_rounds): f() warmup_rounds = 0 timer.start() for _i in range(count): timer.add_event(f()) timer.stop() elapsed = timer.get_elapsed() if elapsed < desired_duration: if elapsed == 0: count *= 5 else: new_count = int(desired_duration/elapsed) new_count = max(2*count, new_count) new_count = min(10*count, new_count) count = new_count else: return elapsed/count # }}} # {{{ transfer measurements class HostDeviceTransferBase: def __init__(self, queue, block_size): self.queue = queue self.host_buf = np.empty(block_size, dtype=np.uint8) self.dev_buf = cl.Buffer(queue.context, cl.mem_flags.READ_WRITE, block_size) class HostToDeviceTransfer(HostDeviceTransferBase): def do(self): return cl.enqueue_copy(self. queue, self.dev_buf, self.host_buf) class DeviceToHostTransfer(HostDeviceTransferBase): def do(self): return cl.enqueue_copy(self. queue, self.host_buf, self.dev_buf) class DeviceToDeviceTransfer: def __init__(self, queue, block_size): self.queue = queue mf = cl.mem_flags self.dev_buf_1 = cl.Buffer(queue.context, mf.READ_WRITE, block_size) self.dev_buf_2 = cl.Buffer(queue.context, mf.READ_WRITE, block_size) def do(self): return cl.enqueue_copy(self. queue, self.dev_buf_2, self.dev_buf_1) def transfer_latency(queue, transfer_type, timer_factory=None): transfer = transfer_type(queue, 1) return _get_time(queue, transfer.do, timer_factory=timer_factory) def transfer_bandwidth(queue, transfer_type, block_size, timer_factory=None): """Measures one-sided bandwidth.""" transfer = transfer_type(queue, block_size) return block_size/_get_time(queue, transfer.do, timer_factory=timer_factory) # }}} def get_profiling_overhead(ctx, timer_factory=None): no_prof_queue = cl.CommandQueue(ctx) transfer = DeviceToDeviceTransfer(no_prof_queue, 1) no_prof_time = _get_time(no_prof_queue, transfer.do, timer_factory=timer_factory) prof_queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE) transfer = DeviceToDeviceTransfer(prof_queue, 1) prof_time = _get_time(prof_queue, transfer.do, timer_factory=timer_factory) return prof_time - no_prof_time, prof_time def get_empty_kernel_time(queue, timer_factory=None): prg = cl.Program(queue.context, """ __kernel void empty() { } """).build() knl = prg.empty def f(): knl(queue, (1,), None) return _get_time(queue, f, timer_factory=timer_factory) def _get_full_machine_kernel_rate(queue, src, args, name="benchmark", timer_factory=None): prg = cl.Program(queue.context, src).build() knl = getattr(prg, name) dev = queue.device global_size = 4 * dev.max_compute_units def f(): knl(queue, (global_size,), None, *args) rates = [] num_dips = 0 while True: elapsed = _get_time(queue, f, timer_factory=timer_factory) rate = global_size/elapsed keep_trying = not rates if rates and rate > 1.05*max(rates): # big improvement keep_trying = True num_dips = 0 if rates and rate < 0.9*max(rates) and num_dips < 3: # big dip keep_trying = True num_dips += 1 if keep_trying: global_size *= 2 rates.append(rate) else: rates.append(rate) return max(rates) def get_add_rate(queue, type="float", timer_factory=None): return 50*10*_get_full_machine_kernel_rate(queue, """ typedef %(op_t)s op_t; __kernel void benchmark() { local op_t tgt[1024]; op_t val = get_global_id(0); for (int i = 0; i < 10; ++i) { val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; val += val; } tgt[get_local_id(0)] = val; } """ % {"op_t": type}, ()) # vim: foldmethod=marker:filetype=pyopencl pyopencl-2025.1/pyopencl/cl/pyopencl-airy.cl0000644000000000000000000001767214332717401015743 0ustar00// Ported from Cephes by // Andreas Kloeckner (C) 2012 // // Cephes Math Library Release 2.8: June, 2000 // Copyright 1984, 1987, 1989, 1992, 2000 by Stephen L. Moshier // What you see here may be used freely, but it comes with no support or // guarantee. #pragma once #include __constant const double airy_maxairy = 103.892; __constant const double airy_sqrt3 = 1.732050807568877293527; __constant const double airy_sqpii = 5.64189583547756286948E-1; __constant const double airy_c1 = 0.35502805388781723926; __constant const double airy_c2 = 0.258819403792806798405; __constant const unsigned short AN[32] = { 0x3fd6,0x2dae,0x2537,0xb658, 0x4028,0x03e3,0x871a,0x9067, 0x4053,0x11e5,0x0de2,0xe1e3, 0x4065,0x02da,0xee40,0x073c, 0x4063,0xf834,0x5ba1,0xfddf, 0x4051,0xa24f,0x4f4c,0xea4f, 0x402c,0x0d8d,0x5c2a,0x0f4d, 0x3ff0,0x0000,0x0000,0x0000, }; __constant const unsigned short AD[32] = { 0x3fe2,0x29bc,0x0262,0x4d31, 0x402d,0x8334,0x0533,0x2ca5, 0x4055,0x20e3,0xb04d,0x51a0, 0x4066,0x2a2d,0xc730,0xb7b0, 0x4064,0x8782,0x9a9f,0xfa61, 0x4051,0xde94,0xee91,0xd35f, 0x402c,0x311b,0x950d,0x9d81, 0x3ff0,0x0000,0x0000,0x0000, }; __constant const unsigned short APN[32] = { 0x3fe3,0xa3ea,0x4d4c,0xab3e, 0x402d,0x7dad,0xdc67,0x2bcf, 0x4054,0x83bd,0x0724,0xa9a6, 0x4065,0x65e9,0xba99,0xc9ba, 0x4063,0xea2b,0xcdc2,0x64d7, 0x4051,0x7e95,0x41d4,0x1646, 0x402b,0xe4e8,0x6aa7,0x4099, 0x3ff0,0x0000,0x0000,0x0000, }; __constant const unsigned short APD[32] = { 0x3fd5,0x6397,0xd288,0xd5b3, 0x4026,0x5caf,0xedc9,0x327e, 0x4051,0xcb0e,0x1800,0x97e6, 0x4063,0xd8e6,0x1132,0xdbd1, 0x4063,0x269b,0x0dcb,0x3316, 0x4051,0x2b36,0xf9d0,0xf72f, 0x402b,0xb321,0x4e35,0x7982, 0x3ff0,0x0000,0x0000,0x0000, }; __constant const unsigned short BN16[20] = { 0xbfd0,0x3518,0xe211,0x6751, 0x3fe2,0x68bc,0x7072,0x2383, 0xbfd5,0x1d32,0x6785,0xcf29, 0x3fb0,0x7f2a,0xa027,0x78a8, 0xbf6f,0x5604,0x2dba,0xcd1b, }; __constant const unsigned short BD16[20] = { /*0x3ff0,0x0000,0x0000,0x0000,*/ 0xc01c,0xa09d,0x891b,0xab58, 0x4025,0x3539,0xfe0b,0x1101, 0xc014,0xee0b,0xa9a7,0x70e8, 0x3fee,0xa2fc,0xa6da,0x95ff, 0xbfac,0x33d0,0x8f8e,0x86c9, }; __constant const unsigned short BPPN[20] = { 0x3fdd,0xca1d,0x9deb,0x377b, 0xbff1,0x7051,0xc6be,0xe420, 0x3fe4,0x710c,0xf199,0x5ff3, 0xbfc0,0x3c6f,0x8681,0xa8fa, 0x3f7f,0x3b43,0xb8ce,0xb896, }; __constant const unsigned short BPPD[20] = { /*0x3ff0,0x0000,0x0000,0x0000,*/ 0xc021,0x6996,0xb340,0xbc45, 0x402b,0xcc73,0x2ea4,0xbb8b, 0xc01c,0x908c,0xa04a,0xed59, 0x3ff5,0x70fd,0xf9a5,0x70a9, 0xbfb4,0x13d0,0x1b60,0x52e8, }; __constant const unsigned short AFN[36] = { 0xbfc0,0xdb6c,0xd50a,0xe6fb, 0xbfe4,0x0bee,0x9856,0x6852, 0xbfe6,0x2e59,0xc2f7,0x9f7d, 0xbfd1,0xe7ea,0x4bb3,0xf40b, 0xbfa9,0x2f6e,0xf47d,0xbd8a, 0xbf70,0xa401,0xc8d9,0xe090, 0xbf24,0xe06e,0xaf4b,0x009c, 0xbec7,0x4a78,0x1d42,0x366d, 0xbe52,0x041c,0xf68e,0xa2d2, }; __constant const unsigned short AFD[36] = { /*0x3ff0,0x0000,0x0000,0x0000,*/ 0x402a,0xb64b,0x2572,0xedf2, 0x4040,0x575c,0x4478,0x7b1a, 0x403a,0xbc98,0xa3b7,0x3410, 0x4022,0x5fc8,0x2ac9,0x9873, 0x3ff7,0x9acb,0x39de,0x9319, 0x3fbd,0x9dac,0xb404,0x5a2b, 0x3f72,0x08ca,0xe03a,0xf617, 0x3f13,0xc8d7,0xaf76,0xe73b, 0x3e9e,0x52b9,0xb995,0x18a7, }; __constant const unsigned short AGN[44] = { 0x3f94,0x3525,0xddcf,0xbbde, 0x3fd9,0x07d5,0x0064,0x37b7, 0x3ff1,0x0d83,0x3a20,0x34eb, 0x3fee,0x0dac,0xa0ef,0x1acb, 0x3fd6,0x7e69,0xcea8,0xfe1d, 0x3fb0,0x3a41,0x21e9,0x0978, 0x3f77,0xfe99,0xf12f,0x5043, 0x3f32,0x8976,0x600e,0x17a2, 0x3edd,0x4f3d,0x69f8,0x574e, 0x3e75,0xca92,0xbbad,0x11c8, 0x3df7,0x78a4,0x7d97,0xee7a, }; __constant const unsigned short AGD[40] = { /*0x3ff0,0x0000,0x0000,0x0000,*/ 0x4022,0x9e2b,0xf3d5,0x6b40, 0x4033,0xd5d5,0xc0ef,0x18d4, 0x402f,0x211b,0x7ea7,0xdc35, 0x4015,0xe84e,0x2b79,0xdbce, 0x3fee,0x8992,0xc195,0xece3, 0x3fb6,0x221d,0xed64,0xa9ee, 0x3f70,0xe704,0x6be3,0x93bb, 0x3f1a,0x8b61,0xd603,0xa5a0, 0x3eb3,0xa845,0xdb07,0x24e8, 0x3e35,0x1fc7,0x3dd5,0x89d4, }; __constant const unsigned short APFN[36] = { 0x3fc7,0xba0f,0x8e7d,0x5db5, 0x3fec,0x5ff2,0x3d14,0xd07e, 0x3fef,0x98b7,0x11be,0x01af, 0x3fd9,0xadef,0x1397,0x84a1, 0x3fb2,0x2f0d,0xeadc,0x33d1, 0x3f78,0x3115,0xe347,0xa140, 0x3f2e,0x8be8,0x5d03,0x8059, 0x3ed1,0x2495,0x9f80,0x12af, 0x3e5a,0xab6a,0x654d,0x7d86, }; __constant const unsigned short APFD[36] = { /*0x3ff0,0x0000,0x0000,0x0000,*/ 0x402d,0x781b,0x9628,0xcc60, 0x4042,0xc56d,0x2524,0x0e31, 0x403f,0x773d,0x09cc,0xffb8, 0x4025,0xfe6b,0x5163,0x03f7, 0x3ffc,0x9f21,0xc07a,0xc9fd, 0x3fc2,0x2450,0xe40e,0xf796, 0x3f76,0x48f2,0x3a5a,0x351a, 0x3f18,0xa059,0x7cfb,0x63a1, 0x3ea2,0xfdb8,0x5a24,0x1e2e, }; __constant const unsigned short APGN[44] = { 0xbfa2,0x351f,0x5f87,0xaf5b, 0xbfe4,0x64db,0x1ff7,0x5c76, 0xbffb,0x564a,0xc221,0x7e49, 0xbff8,0x0916,0x7f6e,0x0b07, 0xbfe2,0x0910,0xd8b0,0x6edb, 0xbfba,0x234b,0x0d8c,0x9903, 0xbf83,0x6c54,0x7f6c,0x50df, 0xbf3e,0x2afa,0x2424,0x2ad0, 0xbee7,0xf87a,0xbc17,0xf631, 0xbe81,0xe81f,0x501e,0x6c10, 0xbe03,0x5f45,0x5e46,0x870d, }; __constant const unsigned short APGD[40] = { /*0x3ff0,0x0000,0x0000,0x0000,*/ 0x4023,0xb7a2,0x060a,0x9812, 0x4035,0xa3e3,0x4724,0xfc96, 0x4031,0x5025,0xdb2c,0x819a, 0x4018,0xb702,0xd5cd,0x94e2, 0x3ff1,0x6a71,0x4927,0x1eb1, 0x3fb9,0x78de,0x4ad7,0x7bc5, 0x3f73,0x991a,0x4b2b,0xc1d7, 0x3f1e,0xf98f,0x0b16,0xbe1c, 0x3eb7,0x10bf,0xfdde,0x4ef3, 0x3e38,0xe834,0x9dc8,0x647e, }; int airy( double x, double *ai, double *aip, double *bi, double *bip ) { typedef __constant const double *data_t; double z, zz, t, f, g, uf, ug, k, zeta, theta; int domflg; domflg = 0; if( x > airy_maxairy ) { *ai = 0; *aip = 0; *bi = DBL_MAX; *bip = DBL_MAX; return(-1); } if( x < -2.09 ) { domflg = 15; t = sqrt(-x); zeta = -2.0 * x * t / 3.0; t = sqrt(t); k = airy_sqpii / t; z = 1.0/zeta; zz = z * z; uf = 1.0 + zz * cephes_polevl( zz, (data_t) AFN, 8 ) / cephes_p1evl( zz, (data_t) AFD, 9 ); ug = z * cephes_polevl( zz, (data_t) AGN, 10 ) / cephes_p1evl( zz, (data_t) AGD, 10 ); theta = zeta + 0.25 * M_PI; f = sin( theta ); g = cos( theta ); *ai = k * (f * uf - g * ug); *bi = k * (g * uf + f * ug); uf = 1.0 + zz * cephes_polevl( zz, (data_t) APFN, 8 ) / cephes_p1evl( zz, (data_t) APFD, 9 ); ug = z * cephes_polevl( zz, (data_t) APGN, 10 ) / cephes_p1evl( zz, (data_t) APGD, 10 ); k = airy_sqpii * t; *aip = -k * (g * uf + f * ug); *bip = k * (f * uf - g * ug); return(0); } if( x >= 2.09 ) /* cbrt(9) */ { domflg = 5; t = sqrt(x); zeta = 2.0 * x * t / 3.0; g = exp( zeta ); t = sqrt(t); k = 2.0 * t * g; z = 1.0/zeta; f = cephes_polevl( z, (data_t) AN, 7 ) / cephes_polevl( z, (data_t) AD, 7 ); *ai = airy_sqpii * f / k; k = -0.5 * airy_sqpii * t / g; f = cephes_polevl( z, (data_t) APN, 7 ) / cephes_polevl( z, (data_t) APD, 7 ); *aip = f * k; if( x > 8.3203353 ) /* zeta > 16 */ { f = z * cephes_polevl( z, (data_t) BN16, 4 ) / cephes_p1evl( z, (data_t) BD16, 5 ); k = airy_sqpii * g; *bi = k * (1.0 + f) / t; f = z * cephes_polevl( z, (data_t) BPPN, 4 ) / cephes_p1evl( z, (data_t) BPPD, 5 ); *bip = k * t * (1.0 + f); return(0); } } f = 1.0; g = x; t = 1.0; uf = 1.0; ug = x; k = 1.0; z = x * x * x; while( t > DBL_EPSILON ) { uf *= z; k += 1.0; uf /=k; ug *= z; k += 1.0; ug /=k; uf /=k; f += uf; k += 1.0; ug /=k; g += ug; t = fabs(uf/f); } uf = airy_c1 * f; ug = airy_c2 * g; if( (domflg & 1) == 0 ) *ai = uf - ug; if( (domflg & 2) == 0 ) *bi = airy_sqrt3 * (uf + ug); /* the deriviative of ai */ k = 4.0; uf = x * x/2.0; ug = z/3.0; f = uf; g = 1.0 + ug; uf /= 3.0; t = 1.0; while( t > DBL_EPSILON ) { uf *= z; ug /=k; k += 1.0; ug *= z; uf /=k; f += uf; k += 1.0; ug /=k; uf /=k; g += ug; k += 1.0; t = fabs(ug/g); } uf = airy_c1 * f; ug = airy_c2 * g; if( (domflg & 4) == 0 ) *aip = uf - ug; if( (domflg & 8) == 0 ) *bip = airy_sqrt3 * (uf + ug); return(0); } pyopencl-2025.1/pyopencl/cl/pyopencl-bessel-j-complex.cl0000644000000000000000000001361214332717401020136 0ustar00/* Evaluate Bessel J function J_v(z) and J_{v+1}(z) with v a nonnegative integer and z anywhere in the complex plane. Copyright (C) Vladimir Rokhlin Copyright (C) 2010-2012 Leslie Greengard and Zydrunas Gimbutas Copyright (C) 2015 Shidong Jiang, Andreas Kloeckner Manually translated from https://github.com/zgimbutas/fmmlib2d/blob/master/src/cdjseval2d.f Originally licensed under GPL, permission to license under MIT granted via email by Vladimir Rokhlin on May 25, 2015 and by Zydrunas Gimbutas on May 17, 2015. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ void bessel_j_complex(int v, cdouble_t z, cdouble_t *j_v, cdouble_t *j_vp1) { int n; int nmax = 10000; int k; int kmax=8; int vscale, vp1scale; double vscaling, vp1scaling; const double small = 2e-1; const double median = 1.0e0; const double upbound = 1e40; const double upbound_inv = 1e-40; double dd; double k_factorial_inv, kv_factorial_inv, kvp1_factorial_inv; cdouble_t z_half, mz_half2, mz_half_2k, z_half_v, z_half_vp1; cdouble_t ima = cdouble_new(0, 1); cdouble_t neg_ima = cdouble_new(0, -1); cdouble_t zinv, ztmp; cdouble_t j_nm1, j_n, j_np1; cdouble_t psi, zsn, zmul, zmulinv; cdouble_t unscaled_j_n, unscaled_j_nm1, unscaled_j_np1; cdouble_t unscaled_j_v, unscaled_j_vp1; cdouble_t scaling; // assert( v >= 0 ); #if 0 if (cdouble_abs(z) < tiny) { if (v == 0) { *j_v = cdouble_new(1, 0); *j_vp1 = cdouble_new(0, 0); } else { *j_v = cdouble_new(0, 0); *j_vp1 = cdouble_new(0, 0); } return; } #endif // {{{ power series for (small z) or (large v and median z) if ( (cdouble_abs(z) < small) || ( (v>12) && (cdouble_abs(z) < median))) { z_half = cdouble_divider(z,2.0); mz_half2 = cdouble_neg(cdouble_mul(z_half, z_half)); z_half_v = cdouble_powr(z_half, v); z_half_vp1 = cdouble_mul(z_half_v, z_half); // compute 1/v! kv_factorial_inv = 1.0; for ( k = 1; k <= v; k++) { kv_factorial_inv /= k; } kvp1_factorial_inv = kv_factorial_inv / (v+1); k_factorial_inv = 1.0; // compute the power series of bessel j function mz_half_2k = cdouble_new(1.0, 0); *j_v = cdouble_new(0, 0); *j_vp1 = cdouble_new(0, 0); for ( k = 0; k < kmax; k++ ) { *j_v = cdouble_add( *j_v, cdouble_mulr(mz_half_2k, kv_factorial_inv*k_factorial_inv)); *j_vp1 = cdouble_add(*j_vp1, cdouble_mulr(mz_half_2k, kvp1_factorial_inv*k_factorial_inv)); mz_half_2k = cdouble_mul(mz_half_2k, mz_half2); k_factorial_inv /= (k+1); kv_factorial_inv /= (k+v+1); kvp1_factorial_inv /= (k+v+2); } *j_v = cdouble_mul(*j_v, z_half_v ); *j_vp1 = cdouble_mul(*j_vp1, z_half_vp1 ); return; } // }}} // {{{ use recurrence for large z j_nm1 = cdouble_new(0, 0); j_n = cdouble_new(1, 0); n = v; zinv = cdouble_rdivide(1,z); while (true) { j_np1 = cdouble_sub( cdouble_mul(cdouble_rmul(2*n, zinv), j_n), j_nm1); n += 1; j_nm1 = j_n; j_n = j_np1; if (n > nmax) { *j_v = cdouble_new(nan(0x8e55e1u), 0); *j_vp1 = cdouble_new(nan(0x8e55e1u), 0); return; } if (cdouble_abs_squared(j_n) > upbound) break; } // downward recursion, account for rescalings // Record the number of times of the missed rescalings // for j_v and j_vp1. unscaled_j_np1 = cdouble_new(0, 0); unscaled_j_n = cdouble_new(1, 0); // Use normalization condition http://dlmf.nist.gov/10.12#E5 psi = cdouble_new(0, 0); if (cdouble_imag(z) <= 0) zmul = ima; else zmul = neg_ima; zsn = cdouble_powr(zmul, n%4); zmulinv = cdouble_rdivide(1, zmul); vscale = 0; vp1scale = 0; while (n > 0) { ztmp = cdouble_sub( cdouble_mul(cdouble_rmul(2*n, zinv), unscaled_j_n), unscaled_j_np1); unscaled_j_nm1 = ztmp; psi = cdouble_add(psi, cdouble_mul(unscaled_j_n, zsn)); zsn = cdouble_mul(zsn, zmulinv); n -= 1; unscaled_j_np1 = unscaled_j_n; unscaled_j_n = unscaled_j_nm1; if (cdouble_abs_squared(ztmp) > upbound) { unscaled_j_np1 = cdouble_rmul(upbound_inv, unscaled_j_np1); unscaled_j_n = cdouble_rmul(upbound_inv, unscaled_j_n); psi = cdouble_rmul(upbound_inv,psi); if (n < v) vscale++; if (n < v+1) vp1scale++; } if (n == v) unscaled_j_v = unscaled_j_n; if (n == v+1) unscaled_j_vp1 = unscaled_j_n; } psi = cdouble_add(cdouble_rmul(2, psi), unscaled_j_n); if ( cdouble_imag(z) <= 0 ) { scaling = cdouble_divide( cdouble_exp( cdouble_mul(ima,z) ), psi); } else { scaling = cdouble_divide( cdouble_exp( cdouble_mul(neg_ima,z) ), psi); } vscaling = pow(upbound_inv, (double) vscale); vp1scaling = pow(upbound_inv, (double) vp1scale); *j_v = cdouble_mul(unscaled_j_v, cdouble_mulr(scaling, vscaling)); *j_vp1 = cdouble_mul(unscaled_j_vp1, cdouble_mulr(scaling,vp1scaling)); // }}} } // vim: fdm=marker pyopencl-2025.1/pyopencl/cl/pyopencl-bessel-j.cl0000644000000000000000000005535214332717401016500 0ustar00// Pieced together from Boost C++ and Cephes by // Andreas Kloeckner (C) 2012 // // Pieces from: // // Copyright (c) 2006 Xiaogang Zhang, John Maddock // Use, modification and distribution are subject to the // Boost Software License, Version 1.0. (See // http://www.boost.org/LICENSE_1_0.txt) // // Cephes Math Library Release 2.8: June, 2000 // Copyright 1984, 1987, 1989, 1992, 2000 by Stephen L. Moshier // What you see here may be used freely, but it comes with no support or // guarantee. #pragma once #include #include typedef double bessel_j_scalar_type; // FIXME: T is really a bad name typedef bessel_j_scalar_type T; // {{{ bessel_j0 __constant const bessel_j_scalar_type bessel_j0_P1[] = { -4.1298668500990866786e+11, 2.7282507878605942706e+10, -6.2140700423540120665e+08, 6.6302997904833794242e+06, -3.6629814655107086448e+04, 1.0344222815443188943e+02, -1.2117036164593528341e-01 }; __constant const bessel_j_scalar_type bessel_j0_Q1[] = { 2.3883787996332290397e+12, 2.6328198300859648632e+10, 1.3985097372263433271e+08, 4.5612696224219938200e+05, 9.3614022392337710626e+02, 1.0, 0.0 }; __constant const bessel_j_scalar_type bessel_j0_P2[] = { -1.8319397969392084011e+03, -1.2254078161378989535e+04, -7.2879702464464618998e+03, 1.0341910641583726701e+04, 1.1725046279757103576e+04, 4.4176707025325087628e+03, 7.4321196680624245801e+02, 4.8591703355916499363e+01 }; __constant const bessel_j_scalar_type bessel_j0_Q2[] = { -3.5783478026152301072e+05, 2.4599102262586308984e+05, -8.4055062591169562211e+04, 1.8680990008359188352e+04, -2.9458766545509337327e+03, 3.3307310774649071172e+02, -2.5258076240801555057e+01, 1.0 }; __constant const bessel_j_scalar_type bessel_j0_PC[] = { 2.2779090197304684302e+04, 4.1345386639580765797e+04, 2.1170523380864944322e+04, 3.4806486443249270347e+03, 1.5376201909008354296e+02, 8.8961548424210455236e-01 }; __constant const bessel_j_scalar_type bessel_j0_QC[] = { 2.2779090197304684318e+04, 4.1370412495510416640e+04, 2.1215350561880115730e+04, 3.5028735138235608207e+03, 1.5711159858080893649e+02, 1.0 }; __constant const bessel_j_scalar_type bessel_j0_PS[] = { -8.9226600200800094098e+01, -1.8591953644342993800e+02, -1.1183429920482737611e+02, -2.2300261666214198472e+01, -1.2441026745835638459e+00, -8.8033303048680751817e-03 }; __constant const bessel_j_scalar_type bessel_j0_QS[] = { 5.7105024128512061905e+03, 1.1951131543434613647e+04, 7.2642780169211018836e+03, 1.4887231232283756582e+03, 9.0593769594993125859e+01, 1.0 }; bessel_j_scalar_type bessel_j0(bessel_j_scalar_type x) { const bessel_j_scalar_type x1 = 2.4048255576957727686e+00, x2 = 5.5200781102863106496e+00, x11 = 6.160e+02, x12 = -1.42444230422723137837e-03, x21 = 1.4130e+03, x22 = 5.46860286310649596604e-04; bessel_j_scalar_type value, factor, r, rc, rs; if (x < 0) { x = -x; // even function } if (x == 0) { return 1; } if (x <= 4) // x in (0, 4] { bessel_j_scalar_type y = x * x; r = boost_evaluate_rational(bessel_j0_P1, bessel_j0_Q1, y); factor = (x + x1) * ((x - x11/256) - x12); value = factor * r; } else if (x <= 8.0) // x in (4, 8] { bessel_j_scalar_type y = 1 - (x * x)/64; r = boost_evaluate_rational(bessel_j0_P2, bessel_j0_Q2, y); factor = (x + x2) * ((x - x21/256) - x22); value = factor * r; } else // x in (8, \infty) { bessel_j_scalar_type y = 8 / x; bessel_j_scalar_type y2 = y * y; bessel_j_scalar_type z = x - 0.25f * M_PI; rc = boost_evaluate_rational(bessel_j0_PC, bessel_j0_QC, y2); rs = boost_evaluate_rational(bessel_j0_PS, bessel_j0_QS, y2); factor = sqrt(2 / (x * M_PI)); value = factor * (rc * cos(z) - y * rs * sin(z)); } return value; } // }}} // {{{ bessel_j1 __constant const bessel_j_scalar_type bessel_j1_P1[] = { -1.4258509801366645672e+11, 6.6781041261492395835e+09, -1.1548696764841276794e+08, 9.8062904098958257677e+05, -4.4615792982775076130e+03, 1.0650724020080236441e+01, -1.0767857011487300348e-02 }; __constant const bessel_j_scalar_type bessel_j1_Q1[] = { 4.1868604460820175290e+12, 4.2091902282580133541e+10, 2.0228375140097033958e+08, 5.9117614494174794095e+05, 1.0742272239517380498e+03, 1.0, 0.0 }; __constant const bessel_j_scalar_type bessel_j1_P2[] = { -1.7527881995806511112e+16, 1.6608531731299018674e+15, -3.6658018905416665164e+13, 3.5580665670910619166e+11, -1.8113931269860667829e+09, 5.0793266148011179143e+06, -7.5023342220781607561e+03, 4.6179191852758252278e+00 }; __constant const bessel_j_scalar_type bessel_j1_Q2[] = { 1.7253905888447681194e+18, 1.7128800897135812012e+16, 8.4899346165481429307e+13, 2.7622777286244082666e+11, 6.4872502899596389593e+08, 1.1267125065029138050e+06, 1.3886978985861357615e+03, 1.0 }; __constant const bessel_j_scalar_type bessel_j1_PC[] = { -4.4357578167941278571e+06, -9.9422465050776411957e+06, -6.6033732483649391093e+06, -1.5235293511811373833e+06, -1.0982405543459346727e+05, -1.6116166443246101165e+03, 0.0 }; __constant const bessel_j_scalar_type bessel_j1_QC[] = { -4.4357578167941278568e+06, -9.9341243899345856590e+06, -6.5853394797230870728e+06, -1.5118095066341608816e+06, -1.0726385991103820119e+05, -1.4550094401904961825e+03, 1.0 }; __constant const bessel_j_scalar_type bessel_j1_PS[] = { 3.3220913409857223519e+04, 8.5145160675335701966e+04, 6.6178836581270835179e+04, 1.8494262873223866797e+04, 1.7063754290207680021e+03, 3.5265133846636032186e+01, 0.0 }; __constant const bessel_j_scalar_type bessel_j1_QS[] = { 7.0871281941028743574e+05, 1.8194580422439972989e+06, 1.4194606696037208929e+06, 4.0029443582266975117e+05, 3.7890229745772202641e+04, 8.6383677696049909675e+02, 1.0 }; bessel_j_scalar_type bessel_j1(bessel_j_scalar_type x) { const bessel_j_scalar_type x1 = 3.8317059702075123156e+00, x2 = 7.0155866698156187535e+00, x11 = 9.810e+02, x12 = -3.2527979248768438556e-04, x21 = 1.7960e+03, x22 = -3.8330184381246462950e-05; bessel_j_scalar_type value, factor, r, rc, rs, w; w = fabs(x); if (x == 0) { return 0; } if (w <= 4) // w in (0, 4] { bessel_j_scalar_type y = x * x; r = boost_evaluate_rational(bessel_j1_P1, bessel_j1_Q1, y); factor = w * (w + x1) * ((w - x11/256) - x12); value = factor * r; } else if (w <= 8) // w in (4, 8] { bessel_j_scalar_type y = x * x; r = boost_evaluate_rational(bessel_j1_P2, bessel_j1_Q2, y); factor = w * (w + x2) * ((w - x21/256) - x22); value = factor * r; } else // w in (8, \infty) { bessel_j_scalar_type y = 8 / w; bessel_j_scalar_type y2 = y * y; bessel_j_scalar_type z = w - 0.75f * M_PI; rc = boost_evaluate_rational(bessel_j1_PC, bessel_j1_QC, y2); rs = boost_evaluate_rational(bessel_j1_PS, bessel_j1_QS, y2); factor = sqrt(2 / (w * M_PI)); value = factor * (rc * cos(z) - y * rs * sin(z)); } if (x < 0) { value *= -1; // odd function } return value; } // }}} // {{{ bessel_recur /* Reduce the order by backward recurrence. * AMS55 #9.1.27 and 9.1.73. */ #define BESSEL_BIG 1.44115188075855872E+17 double bessel_recur(double *n, double x, double *newn, int cancel ) { double pkm2, pkm1, pk, qkm2, qkm1; /* double pkp1; */ double k, ans, qk, xk, yk, r, t, kf; const double big = BESSEL_BIG; int nflag, ctr; /* continued fraction for Jn(x)/Jn-1(x) */ if( *n < 0.0 ) nflag = 1; else nflag = 0; fstart: #if DEBUG printf( "recur: n = %.6e, newn = %.6e, cfrac = ", *n, *newn ); #endif pkm2 = 0.0; qkm2 = 1.0; pkm1 = x; qkm1 = *n + *n; xk = -x * x; yk = qkm1; ans = 1.0; ctr = 0; do { yk += 2.0; pk = pkm1 * yk + pkm2 * xk; qk = qkm1 * yk + qkm2 * xk; pkm2 = pkm1; pkm1 = pk; qkm2 = qkm1; qkm1 = qk; if( qk != 0 ) r = pk/qk; else r = 0.0; if( r != 0 ) { t = fabs( (ans - r)/r ); ans = r; } else t = 1.0; if( ++ctr > 1000 ) { //mtherr( "jv", UNDERFLOW ); pk = nan((uint)24); goto done; } if( t < DBL_EPSILON ) goto done; if( fabs(pk) > big ) { pkm2 /= big; pkm1 /= big; qkm2 /= big; qkm1 /= big; } } while( t > DBL_EPSILON ); done: #if DEBUG printf( "%.6e\n", ans ); #endif /* Change n to n-1 if n < 0 and the continued fraction is small */ if( nflag > 0 ) { if( fabs(ans) < 0.125 ) { nflag = -1; *n = *n - 1.0; goto fstart; } } kf = *newn; /* backward recurrence * 2k * J (x) = --- J (x) - J (x) * k-1 x k k+1 */ pk = 1.0; pkm1 = 1.0/ans; k = *n - 1.0; r = 2 * k; do { pkm2 = (pkm1 * r - pk * x) / x; /* pkp1 = pk; */ pk = pkm1; pkm1 = pkm2; r -= 2.0; /* t = fabs(pkp1) + fabs(pk); if( (k > (kf + 2.5)) && (fabs(pkm1) < 0.25*t) ) { k -= 1.0; t = x*x; pkm2 = ( (r*(r+2.0)-t)*pk - r*x*pkp1 )/t; pkp1 = pk; pk = pkm1; pkm1 = pkm2; r -= 2.0; } */ k -= 1.0; } while( k > (kf + 0.5) ); /* Take the larger of the last two iterates * on the theory that it may have less cancellation error. */ if( cancel ) { if( (kf >= 0.0) && (fabs(pk) > fabs(pkm1)) ) { k += 1.0; pkm2 = pk; } } *newn = k; #if DEBUG printf( "newn %.6e rans %.6e\n", k, pkm2 ); #endif return( pkm2 ); } // }}} // {{{ bessel_jvs #define BESSEL_MAXGAM 171.624376956302725 #define BESSEL_MAXLOG 7.09782712893383996843E2 /* Ascending power series for Jv(x). * AMS55 #9.1.10. */ double bessel_jvs(double n, double x) { double t, u, y, z, k; int ex; int sgngam = 1; z = -x * x / 4.0; u = 1.0; y = u; k = 1.0; t = 1.0; while( t > DBL_EPSILON ) { u *= z / (k * (n+k)); y += u; k += 1.0; if( y != 0 ) t = fabs( u/y ); } #if DEBUG printf( "power series=%.5e ", y ); #endif t = frexp( 0.5*x, &ex ); ex = ex * n; if( (ex > -1023) && (ex < 1023) && (n > 0.0) && (n < (BESSEL_MAXGAM-1.0)) ) { t = pow( 0.5*x, n ) / tgamma( n + 1.0 ); #if DEBUG printf( "pow(.5*x, %.4e)/gamma(n+1)=%.5e\n", n, t ); #endif y *= t; } else { #if DEBUG z = n * log(0.5*x); k = lgamma( n+1.0 ); t = z - k; printf( "log pow=%.5e, lgam(%.4e)=%.5e\n", z, n+1.0, k ); #else t = n * log(0.5*x) - lgamma(n + 1.0); #endif if( y < 0 ) { sgngam = -sgngam; y = -y; } t += log(y); #if DEBUG printf( "log y=%.5e\n", log(y) ); #endif if( t < -BESSEL_MAXLOG ) { return( 0.0 ); } if( t > BESSEL_MAXLOG ) { // mtherr( "Jv", OVERFLOW ); return( DBL_MAX); } y = sgngam * exp( t ); } return(y); } // }}} // {{{ bessel_jnt __constant const double bessel_jnt_PF2[] = { -9.0000000000000000000e-2, 8.5714285714285714286e-2 }; __constant const double bessel_jnt_PF3[] = { 1.3671428571428571429e-1, -5.4920634920634920635e-2, -4.4444444444444444444e-3 }; __constant const double bessel_jnt_PF4[] = { 1.3500000000000000000e-3, -1.6036054421768707483e-1, 4.2590187590187590188e-2, 2.7330447330447330447e-3 }; __constant const double bessel_jnt_PG1[] = { -2.4285714285714285714e-1, 1.4285714285714285714e-2 }; __constant const double bessel_jnt_PG2[] = { -9.0000000000000000000e-3, 1.9396825396825396825e-1, -1.1746031746031746032e-2 }; __constant const double bessel_jnt_PG3[] = { 1.9607142857142857143e-2, -1.5983694083694083694e-1, 6.3838383838383838384e-3 }; double bessel_jnt(double n, double x) { double z, zz, z3; double cbn, n23, cbtwo; double ai, aip, bi, bip; /* Airy functions */ double nk, fk, gk, pp, qq; double F[5], G[4]; int k; cbn = cbrt(n); z = (x - n)/cbn; cbtwo = cbrt( 2.0 ); /* Airy function */ zz = -cbtwo * z; airy( zz, &ai, &aip, &bi, &bip ); /* polynomials in expansion */ zz = z * z; z3 = zz * z; F[0] = 1.0; F[1] = -z/5.0; F[2] = cephes_polevl( z3, bessel_jnt_PF2, 1 ) * zz; F[3] = cephes_polevl( z3, bessel_jnt_PF3, 2 ); F[4] = cephes_polevl( z3, bessel_jnt_PF4, 3 ) * z; G[0] = 0.3 * zz; G[1] = cephes_polevl( z3, bessel_jnt_PG1, 1 ); G[2] = cephes_polevl( z3, bessel_jnt_PG2, 2 ) * z; G[3] = cephes_polevl( z3, bessel_jnt_PG3, 2 ) * zz; #if DEBUG for( k=0; k<=4; k++ ) printf( "F[%d] = %.5E\n", k, F[k] ); for( k=0; k<=3; k++ ) printf( "G[%d] = %.5E\n", k, G[k] ); #endif pp = 0.0; qq = 0.0; nk = 1.0; n23 = cbrt( n * n ); for( k=0; k<=4; k++ ) { fk = F[k]*nk; pp += fk; if( k != 4 ) { gk = G[k]*nk; qq += gk; } #if DEBUG printf("fk[%d] %.5E, gk[%d] %.5E\n", k, fk, k, gk ); #endif nk /= n23; } fk = cbtwo * ai * pp/cbn + cbrt(4.0) * aip * qq/n; return(fk); } // }}} // {{{ bessel_jnx __constant const double bessel_jnx_lambda[] = { 1.0, 1.041666666666666666666667E-1, 8.355034722222222222222222E-2, 1.282265745563271604938272E-1, 2.918490264641404642489712E-1, 8.816272674437576524187671E-1, 3.321408281862767544702647E+0, 1.499576298686255465867237E+1, 7.892301301158651813848139E+1, 4.744515388682643231611949E+2, 3.207490090890661934704328E+3 }; __constant const double bessel_jnx_mu[] = { 1.0, -1.458333333333333333333333E-1, -9.874131944444444444444444E-2, -1.433120539158950617283951E-1, -3.172272026784135480967078E-1, -9.424291479571202491373028E-1, -3.511203040826354261542798E+0, -1.572726362036804512982712E+1, -8.228143909718594444224656E+1, -4.923553705236705240352022E+2, -3.316218568547972508762102E+3 }; __constant const double bessel_jnx_P1[] = { -2.083333333333333333333333E-1, 1.250000000000000000000000E-1 }; __constant const double bessel_jnx_P2[] = { 3.342013888888888888888889E-1, -4.010416666666666666666667E-1, 7.031250000000000000000000E-2 }; __constant const double bessel_jnx_P3[] = { -1.025812596450617283950617E+0, 1.846462673611111111111111E+0, -8.912109375000000000000000E-1, 7.324218750000000000000000E-2 }; __constant const double bessel_jnx_P4[] = { 4.669584423426247427983539E+0, -1.120700261622299382716049E+1, 8.789123535156250000000000E+0, -2.364086914062500000000000E+0, 1.121520996093750000000000E-1 }; __constant const double bessel_jnx_P5[] = { -2.8212072558200244877E1, 8.4636217674600734632E1, -9.1818241543240017361E1, 4.2534998745388454861E1, -7.3687943594796316964E0, 2.27108001708984375E-1 }; __constant const double bessel_jnx_P6[] = { 2.1257013003921712286E2, -7.6525246814118164230E2, 1.0599904525279998779E3, -6.9957962737613254123E2, 2.1819051174421159048E2, -2.6491430486951555525E1, 5.7250142097473144531E-1 }; __constant const double bessel_jnx_P7[] = { -1.9194576623184069963E3, 8.0617221817373093845E3, -1.3586550006434137439E4, 1.1655393336864533248E4, -5.3056469786134031084E3, 1.2009029132163524628E3, -1.0809091978839465550E2, 1.7277275025844573975E0 }; double bessel_jnx(double n, double x) { double zeta, sqz, zz, zp, np; double cbn, n23, t, z, sz; double pp, qq, z32i, zzi; double ak, bk, akl, bkl; int sign, doa, dob, nflg, k, s, tk, tkp1, m; double u[8]; double ai, aip, bi, bip; /* Test for x very close to n. * Use expansion for transition region if so. */ cbn = cbrt(n); z = (x - n)/cbn; if( fabs(z) <= 0.7 ) return( bessel_jnt(n,x) ); z = x/n; zz = 1.0 - z*z; if( zz == 0.0 ) return(0.0); if( zz > 0.0 ) { sz = sqrt( zz ); t = 1.5 * (log( (1.0+sz)/z ) - sz ); /* zeta ** 3/2 */ zeta = cbrt( t * t ); nflg = 1; } else { sz = sqrt(-zz); t = 1.5 * (sz - acos(1.0/z)); zeta = -cbrt( t * t ); nflg = -1; } z32i = fabs(1.0/t); sqz = cbrt(t); /* Airy function */ n23 = cbrt( n * n ); t = n23 * zeta; #if DEBUG printf("zeta %.5E, Airy(%.5E)\n", zeta, t ); #endif airy( t, &ai, &aip, &bi, &bip ); /* polynomials in expansion */ u[0] = 1.0; zzi = 1.0/zz; u[1] = cephes_polevl( zzi, bessel_jnx_P1, 1 )/sz; u[2] = cephes_polevl( zzi, bessel_jnx_P2, 2 )/zz; u[3] = cephes_polevl( zzi, bessel_jnx_P3, 3 )/(sz*zz); pp = zz*zz; u[4] = cephes_polevl( zzi, bessel_jnx_P4, 4 )/pp; u[5] = cephes_polevl( zzi, bessel_jnx_P5, 5 )/(pp*sz); pp *= zz; u[6] = cephes_polevl( zzi, bessel_jnx_P6, 6 )/pp; u[7] = cephes_polevl( zzi, bessel_jnx_P7, 7 )/(pp*sz); #if DEBUG for( k=0; k<=7; k++ ) printf( "u[%d] = %.5E\n", k, u[k] ); #endif pp = 0.0; qq = 0.0; np = 1.0; /* flags to stop when terms get larger */ doa = 1; dob = 1; akl = DBL_MAX; bkl = DBL_MAX; for( k=0; k<=3; k++ ) { tk = 2 * k; tkp1 = tk + 1; zp = 1.0; ak = 0.0; bk = 0.0; for( s=0; s<=tk; s++ ) { if( doa ) { if( (s & 3) > 1 ) sign = nflg; else sign = 1; ak += sign * bessel_jnx_mu[s] * zp * u[tk-s]; } if( dob ) { m = tkp1 - s; if( ((m+1) & 3) > 1 ) sign = nflg; else sign = 1; bk += sign * bessel_jnx_lambda[s] * zp * u[m]; } zp *= z32i; } if( doa ) { ak *= np; t = fabs(ak); if( t < akl ) { akl = t; pp += ak; } else doa = 0; } if( dob ) { bk += bessel_jnx_lambda[tkp1] * zp * u[0]; bk *= -np/sqz; t = fabs(bk); if( t < bkl ) { bkl = t; qq += bk; } else dob = 0; } #if DEBUG printf("a[%d] %.5E, b[%d] %.5E\n", k, ak, k, bk ); #endif if( np < DBL_EPSILON ) break; np /= n*n; } /* normalizing factor ( 4*zeta/(1 - z**2) )**1/4 */ t = 4.0 * zeta/zz; t = sqrt( sqrt(t) ); t *= ai*pp/cbrt(n) + aip*qq/(n23*n); return(t); } // }}} // {{{ bessel_hankel /* Hankel's asymptotic expansion * for large x. * AMS55 #9.2.5. */ double bessel_hankel( double n, double x ) { double t, u, z, k, sign, conv; double p, q, j, m, pp, qq; int flag; m = 4.0*n*n; j = 1.0; z = 8.0 * x; k = 1.0; p = 1.0; u = (m - 1.0)/z; q = u; sign = 1.0; conv = 1.0; flag = 0; t = 1.0; pp = 1.0e38; qq = 1.0e38; while( t > DBL_EPSILON ) { k += 2.0; j += 1.0; sign = -sign; u *= (m - k * k)/(j * z); p += sign * u; k += 2.0; j += 1.0; u *= (m - k * k)/(j * z); q += sign * u; t = fabs(u/p); if( t < conv ) { conv = t; qq = q; pp = p; flag = 1; } /* stop if the terms start getting larger */ if( (flag != 0) && (t > conv) ) { #if DEBUG printf( "Hankel: convergence to %.4E\n", conv ); #endif goto hank1; } } hank1: u = x - (0.5*n + 0.25) * M_PI; t = sqrt( 2.0/(M_PI*x) ) * ( pp * cos(u) - qq * sin(u) ); #if DEBUG printf( "hank: %.6e\n", t ); #endif return( t ); } // }}} // {{{ bessel_jv // SciPy says jn has no advantage over jv, so alias the two. #define bessel_jn bessel_jv double bessel_jv(double n, double x) { double k, q, t, y, an; int i, sign, nint; nint = 0; /* Flag for integer n */ sign = 1; /* Flag for sign inversion */ an = fabs( n ); y = floor( an ); if( y == an ) { nint = 1; i = an - 16384.0 * floor( an/16384.0 ); if( n < 0.0 ) { if( i & 1 ) sign = -sign; n = an; } if( x < 0.0 ) { if( i & 1 ) sign = -sign; x = -x; } if( n == 0.0 ) return( bessel_j0(x) ); if( n == 1.0 ) return( sign * bessel_j1(x) ); } if( (x < 0.0) && (y != an) ) { // mtherr( "Jv", DOMAIN ); // y = 0.0; y = nan((uint)22); goto done; } y = fabs(x); if( y < DBL_EPSILON ) goto underf; k = 3.6 * sqrt(y); t = 3.6 * sqrt(an); if( (y < t) && (an > 21.0) ) return( sign * bessel_jvs(n,x) ); if( (an < k) && (y > 21.0) ) return( sign * bessel_hankel(n,x) ); if( an < 500.0 ) { /* Note: if x is too large, the continued * fraction will fail; but then the * Hankel expansion can be used. */ if( nint != 0 ) { k = 0.0; q = bessel_recur( &n, x, &k, 1 ); if( k == 0.0 ) { y = bessel_j0(x)/q; goto done; } if( k == 1.0 ) { y = bessel_j1(x)/q; goto done; } } if( an > 2.0 * y ) goto rlarger; if( (n >= 0.0) && (n < 20.0) && (y > 6.0) && (y < 20.0) ) { /* Recur backwards from a larger value of n */ rlarger: k = n; y = y + an + 1.0; if( y < 30.0 ) y = 30.0; y = n + floor(y-n); q = bessel_recur( &y, x, &k, 0 ); y = bessel_jvs(y,x) * q; goto done; } if( k <= 30.0 ) { k = 2.0; } else if( k < 90.0 ) { k = (3*k)/4; } if( an > (k + 3.0) ) { if( n < 0.0 ) k = -k; q = n - floor(n); k = floor(k) + q; if( n > 0.0 ) q = bessel_recur( &n, x, &k, 1 ); else { t = k; k = n; q = bessel_recur( &t, x, &k, 1 ); k = t; } if( q == 0.0 ) { underf: y = 0.0; goto done; } } else { k = n; q = 1.0; } /* boundary between convergence of * power series and Hankel expansion */ y = fabs(k); if( y < 26.0 ) t = (0.0083*y + 0.09)*y + 12.9; else t = 0.9 * y; if( x > t ) y = bessel_hankel(k,x); else y = bessel_jvs(k,x); #if DEBUG printf( "y = %.16e, recur q = %.16e\n", y, q ); #endif if( n > 0.0 ) y /= q; else y *= q; } else { /* For large n, use the uniform expansion * or the transitional expansion. * But if x is of the order of n**2, * these may blow up, whereas the * Hankel expansion will then work. */ if( n < 0.0 ) { //mtherr( "Jv", TLOSS ); //y = 0.0; y = nan((uint)23); goto done; } t = x/n; t /= n; if( t > 0.3 ) y = bessel_hankel(n,x); else y = bessel_jnx(n,x); } done: return( sign * y); } // }}} // vim: fdm=marker pyopencl-2025.1/pyopencl/cl/pyopencl-bessel-y.cl0000644000000000000000000003001114332717401016500 0ustar00// Pieced together from Boost C++ and Cephes by // Andreas Kloeckner (C) 2012 // // Pieces from: // // Copyright (c) 2006 Xiaogang Zhang, John Maddock // Use, modification and distribution are subject to the // Boost Software License, Version 1.0. (See // http://www.boost.org/LICENSE_1_0.txt) // // Cephes Math Library Release 2.8: June, 2000 // Copyright 1984, 1987, 1989, 1992, 2000 by Stephen L. Moshier // What you see here may be used freely, but it comes with no support or // guarantee. #pragma once #include #include typedef double bessel_y_scalar_type; // {{{ bessel_y0 __constant const bessel_y_scalar_type bessel_y0_P1[] = { 1.0723538782003176831e+11, -8.3716255451260504098e+09, 2.0422274357376619816e+08, -2.1287548474401797963e+06, 1.0102532948020907590e+04, -1.8402381979244993524e+01, }; __constant const bessel_y_scalar_type bessel_y0_Q1[] = { 5.8873865738997033405e+11, 8.1617187777290363573e+09, 5.5662956624278251596e+07, 2.3889393209447253406e+05, 6.6475986689240190091e+02, 1.0, }; __constant const bessel_y_scalar_type bessel_y0_P2[] = { -2.2213976967566192242e+13, -5.5107435206722644429e+11, 4.3600098638603061642e+10, -6.9590439394619619534e+08, 4.6905288611678631510e+06, -1.4566865832663635920e+04, 1.7427031242901594547e+01, }; __constant const bessel_y_scalar_type bessel_y0_Q2[] = { 4.3386146580707264428e+14, 5.4266824419412347550e+12, 3.4015103849971240096e+10, 1.3960202770986831075e+08, 4.0669982352539552018e+05, 8.3030857612070288823e+02, 1.0, }; __constant const bessel_y_scalar_type bessel_y0_P3[] = { -8.0728726905150210443e+15, 6.7016641869173237784e+14, -1.2829912364088687306e+11, -1.9363051266772083678e+11, 2.1958827170518100757e+09, -1.0085539923498211426e+07, 2.1363534169313901632e+04, -1.7439661319197499338e+01, }; __constant const bessel_y_scalar_type bessel_y0_Q3[] = { 3.4563724628846457519e+17, 3.9272425569640309819e+15, 2.2598377924042897629e+13, 8.6926121104209825246e+10, 2.4727219475672302327e+08, 5.3924739209768057030e+05, 8.7903362168128450017e+02, 1.0, }; __constant const bessel_y_scalar_type bessel_y0_PC[] = { 2.2779090197304684302e+04, 4.1345386639580765797e+04, 2.1170523380864944322e+04, 3.4806486443249270347e+03, 1.5376201909008354296e+02, 8.8961548424210455236e-01, }; __constant const bessel_y_scalar_type bessel_y0_QC[] = { 2.2779090197304684318e+04, 4.1370412495510416640e+04, 2.1215350561880115730e+04, 3.5028735138235608207e+03, 1.5711159858080893649e+02, 1.0, }; __constant const bessel_y_scalar_type bessel_y0_PS[] = { -8.9226600200800094098e+01, -1.8591953644342993800e+02, -1.1183429920482737611e+02, -2.2300261666214198472e+01, -1.2441026745835638459e+00, -8.8033303048680751817e-03, }; __constant const bessel_y_scalar_type bessel_y0_QS[] = { 5.7105024128512061905e+03, 1.1951131543434613647e+04, 7.2642780169211018836e+03, 1.4887231232283756582e+03, 9.0593769594993125859e+01, 1.0, }; bessel_y_scalar_type bessel_y0(bessel_y_scalar_type x) { const bessel_y_scalar_type x1 = 8.9357696627916752158e-01, x2 = 3.9576784193148578684e+00, x3 = 7.0860510603017726976e+00, x11 = 2.280e+02, x12 = 2.9519662791675215849e-03, x21 = 1.0130e+03, x22 = 6.4716931485786837568e-04, x31 = 1.8140e+03, x32 = 1.1356030177269762362e-04; bessel_y_scalar_type value, factor, r, rc, rs; if (x < 0) { //return policies::raise_domain_error(function, // "Got x = %1% but x must be non-negative, complex result not supported.", x, pol); return nan((uint)22); } if (x == 0) { return -DBL_MAX; } if (x <= 3) // x in (0, 3] { bessel_y_scalar_type y = x * x; bessel_y_scalar_type z = 2 * log(x/x1) * bessel_j0(x) / M_PI; r = boost_evaluate_rational(bessel_y0_P1, bessel_y0_Q1, y); factor = (x + x1) * ((x - x11/256) - x12); value = z + factor * r; } else if (x <= 5.5f) // x in (3, 5.5] { bessel_y_scalar_type y = x * x; bessel_y_scalar_type z = 2 * log(x/x2) * bessel_j0(x) / M_PI; r = boost_evaluate_rational(bessel_y0_P2, bessel_y0_Q2, y); factor = (x + x2) * ((x - x21/256) - x22); value = z + factor * r; } else if (x <= 8) // x in (5.5, 8] { bessel_y_scalar_type y = x * x; bessel_y_scalar_type z = 2 * log(x/x3) * bessel_j0(x) / M_PI; r = boost_evaluate_rational(bessel_y0_P3, bessel_y0_Q3, y); factor = (x + x3) * ((x - x31/256) - x32); value = z + factor * r; } else // x in (8, \infty) { bessel_y_scalar_type y = 8 / x; bessel_y_scalar_type y2 = y * y; bessel_y_scalar_type z = x - 0.25f * M_PI; rc = boost_evaluate_rational(bessel_y0_PC, bessel_y0_QC, y2); rs = boost_evaluate_rational(bessel_y0_PS, bessel_y0_QS, y2); factor = sqrt(2 / (x * M_PI)); value = factor * (rc * sin(z) + y * rs * cos(z)); } return value; } // }}} // {{{ bessel_y1 __constant const bessel_y_scalar_type bessel_y1_P1[] = { 4.0535726612579544093e+13, 5.4708611716525426053e+12, -3.7595974497819597599e+11, 7.2144548214502560419e+09, -5.9157479997408395984e+07, 2.2157953222280260820e+05, -3.1714424660046133456e+02, }; __constant const bessel_y_scalar_type bessel_y1_Q1[] = { 3.0737873921079286084e+14, 4.1272286200406461981e+12, 2.7800352738690585613e+10, 1.2250435122182963220e+08, 3.8136470753052572164e+05, 8.2079908168393867438e+02, 1.0, }; __constant const bessel_y_scalar_type bessel_y1_P2[] = { 1.1514276357909013326e+19, -5.6808094574724204577e+18, -2.3638408497043134724e+16, 4.0686275289804744814e+15, -5.9530713129741981618e+13, 3.7453673962438488783e+11, -1.1957961912070617006e+09, 1.9153806858264202986e+06, -1.2337180442012953128e+03, }; __constant const bessel_y_scalar_type bessel_y1_Q2[] = { 5.3321844313316185697e+20, 5.6968198822857178911e+18, 3.0837179548112881950e+16, 1.1187010065856971027e+14, 3.0221766852960403645e+11, 6.3550318087088919566e+08, 1.0453748201934079734e+06, 1.2855164849321609336e+03, 1.0, }; __constant const bessel_y_scalar_type bessel_y1_PC[] = { -4.4357578167941278571e+06, -9.9422465050776411957e+06, -6.6033732483649391093e+06, -1.5235293511811373833e+06, -1.0982405543459346727e+05, -1.6116166443246101165e+03, 0.0, }; __constant const bessel_y_scalar_type bessel_y1_QC[] = { -4.4357578167941278568e+06, -9.9341243899345856590e+06, -6.5853394797230870728e+06, -1.5118095066341608816e+06, -1.0726385991103820119e+05, -1.4550094401904961825e+03, 1.0, }; __constant const bessel_y_scalar_type bessel_y1_PS[] = { 3.3220913409857223519e+04, 8.5145160675335701966e+04, 6.6178836581270835179e+04, 1.8494262873223866797e+04, 1.7063754290207680021e+03, 3.5265133846636032186e+01, 0.0, }; __constant const bessel_y_scalar_type bessel_y1_QS[] = { 7.0871281941028743574e+05, 1.8194580422439972989e+06, 1.4194606696037208929e+06, 4.0029443582266975117e+05, 3.7890229745772202641e+04, 8.6383677696049909675e+02, 1.0, }; bessel_y_scalar_type bessel_y1(bessel_y_scalar_type x) { const bessel_y_scalar_type x1 = 2.1971413260310170351e+00, x2 = 5.4296810407941351328e+00, x11 = 5.620e+02, x12 = 1.8288260310170351490e-03, x21 = 1.3900e+03, x22 = -6.4592058648672279948e-06 ; bessel_y_scalar_type value, factor, r, rc, rs; if (x <= 0) { // domain error return nan((uint)22); } if (x <= 4) // x in (0, 4] { bessel_y_scalar_type y = x * x; bessel_y_scalar_type z = 2 * log(x/x1) * bessel_j1(x) / M_PI; r = boost_evaluate_rational(bessel_y1_P1, bessel_y1_Q1, y); factor = (x + x1) * ((x - x11/256) - x12) / x; value = z + factor * r; } else if (x <= 8) // x in (4, 8] { bessel_y_scalar_type y = x * x; bessel_y_scalar_type z = 2 * log(x/x2) * bessel_j1(x) / M_PI; r = boost_evaluate_rational(bessel_y1_P2, bessel_y1_Q2, y); factor = (x + x2) * ((x - x21/256) - x22) / x; value = z + factor * r; } else // x in (8, \infty) { bessel_y_scalar_type y = 8 / x; bessel_y_scalar_type y2 = y * y; bessel_y_scalar_type z = x - 0.75f * M_PI; rc = boost_evaluate_rational(bessel_y1_PC, bessel_y1_QC, y2); rs = boost_evaluate_rational(bessel_y1_PS, bessel_y1_QS, y2); factor = sqrt(2 / (x * M_PI)); value = factor * (rc * sin(z) + y * rs * cos(z)); } return value; } // }}} // {{{ bessel_yn bessel_y_scalar_type bessel_yn_small_z(int n, bessel_y_scalar_type z, bessel_y_scalar_type* scale) { // // See http://functions.wolfram.com/Bessel-TypeFunctions/BesselY/06/01/04/01/02/ // // Note that when called we assume that x < epsilon and n is a positive integer. // // BOOST_ASSERT(n >= 0); // BOOST_ASSERT((z < policies::get_epsilon())); if(n == 0) { return (2 / M_PI) * (log(z / 2) + M_E); } else if(n == 1) { return (z / M_PI) * log(z / 2) - 2 / (M_PI * z) - (z / (2 * M_PI)) * (1 - 2 * M_E); } else if(n == 2) { return (z * z) / (4 * M_PI) * log(z / 2) - (4 / (M_PI * z * z)) - ((z * z) / (8 * M_PI)) * (3./2 - 2 * M_E); } else { bessel_y_scalar_type p = pow(z / 2, (bessel_y_scalar_type) n); bessel_y_scalar_type result = -((tgamma((bessel_y_scalar_type) n) / M_PI)); if(p * DBL_MAX < result) { bessel_y_scalar_type div = DBL_MAX / 8; result /= div; *scale /= div; if(p * DBL_MAX < result) { return -DBL_MAX; } } return result / p; } } bessel_y_scalar_type bessel_yn(int n, bessel_y_scalar_type x) { //BOOST_MATH_STD_USING bessel_y_scalar_type value, factor, current, prev; //using namespace boost::math::tools; if ((x == 0) && (n == 0)) { return -DBL_MAX; } if (x <= 0) { //return policies::raise_domain_error(function, //"Got x = %1%, but x must be > 0, complex result not supported.", x, pol); return nan((uint)22); } // // Reflection comes first: // if (n < 0) { factor = (n & 0x1) ? -1 : 1; // Y_{-n}(z) = (-1)^n Y_n(z) n = -n; } else { factor = 1; } if(x < DBL_EPSILON) { bessel_y_scalar_type scale = 1; value = bessel_yn_small_z(n, x, &scale); if(DBL_MAX * fabs(scale) < fabs(value)) return copysign((bessel_y_scalar_type) 1, scale) * copysign((bessel_y_scalar_type) 1, value) * DBL_MAX; value /= scale; } else if (n == 0) { value = bessel_y0(x); } else if (n == 1) { value = factor * bessel_y1(x); } else { prev = bessel_y0(x); current = bessel_y1(x); int k = 1; // BOOST_ASSERT(k < n); do { bessel_y_scalar_type fact = 2 * k / x; if((DBL_MAX - fabs(prev)) / fact < fabs(current)) { prev /= current; factor /= current; current = 1; } value = fact * current - prev; prev = current; current = value; ++k; } while(k < n); if(fabs(DBL_MAX * factor) < fabs(value)) return sign(value) * sign(value) * DBL_MAX; value /= factor; } return value; } // }}} // vim: fdm=marker pyopencl-2025.1/pyopencl/cl/pyopencl-complex.h0000644000000000000000000002054014332717401016263 0ustar00/* * Copyright (c) 1999 * Silicon Graphics Computer Systems, Inc. * * Copyright (c) 1999 * Boris Fomitchev * * Copyright (c) 2012 * Andreas Kloeckner * * This material is provided "as is", with absolutely no warranty expressed * or implied. Any use is at your own risk. * * Permission to use or copy this software for any purpose is hereby granted * without fee, provided the above notices are retained on all copies. * Permission to modify the code and to distribute modified code is granted, * provided the above notices are retained, and a notice that the code was * modified is included with the above copyright notice. * */ // This file is available for inclusion in pyopencl kernels and provides // complex types 'cfloat_t' and 'cdouble_t', along with a number of special // functions as visible below, e.g. cdouble_log(z). // // Under the hood, the complex types are simply float2 and double2. // Note that native (operator-based) addition (float + float2) and // multiplication (float2*float1) is defined for these types, // but do not match the rules of complex arithmetic. #pragma once #define PYOPENCL_DECLARE_COMPLEX_TYPE_INT(REAL_TP, REAL_3LTR, TPROOT, TP) \ \ inline REAL_TP TPROOT##_real(TP a) { return a.real; } \ inline REAL_TP TPROOT##_imag(TP a) { return a.imag; } \ inline REAL_TP TPROOT##_abs(TP a) { return hypot(a.real, a.imag); } \ inline REAL_TP TPROOT##_abs_squared(TP a) { return a.real * a.real + a.imag * a.imag; } \ \ inline TP TPROOT##_new(REAL_TP real, REAL_TP imag) \ { \ TP result; \ result.real = real; \ result.imag = imag; \ return result; \ } \ \ inline TP TPROOT##_fromreal(REAL_TP real) \ { \ TP result; \ result.real = real; \ result.imag = 0; \ return result; \ } \ \ \ inline TP TPROOT##_neg(TP a) { return TPROOT##_new(-a.real, -a.imag); } \ inline TP TPROOT##_conj(TP a) { return TPROOT##_new(a.real, -a.imag); } \ \ inline TP TPROOT##_add(TP a, TP b) \ { \ return TPROOT##_new(a.real + b.real, a.imag + b.imag); \ ; \ } \ inline TP TPROOT##_addr(TP a, REAL_TP b) \ { \ return TPROOT##_new(b+a.real, a.imag); \ } \ inline TP TPROOT##_radd(REAL_TP a, TP b) \ { \ return TPROOT##_new(a+b.real, b.imag); \ } \ \ inline TP TPROOT##_sub(TP a, TP b) \ { \ return TPROOT##_new(a.real - b.real, a.imag - b.imag); \ ; \ } \ \ inline TP TPROOT##_fma(TP a, TP b, TP c) \ { \ return TPROOT##_new( \ fma(a.real, b.real, c.real) - a.imag*b.imag, \ fma(a.imag, b.real, fma(a.real, b.imag, c.imag))); \ } \ \ inline TP TPROOT##_mul(TP a, TP b) \ { \ return TPROOT##_new( \ a.real*b.real - a.imag*b.imag, \ a.real*b.imag + a.imag*b.real); \ } \ \ inline TP TPROOT##_mulr(TP a, REAL_TP b) \ { \ return TPROOT##_new(a.real*b, a.imag*b); \ } \ \ inline TP TPROOT##_rmul(REAL_TP a, TP b) \ { \ return TPROOT##_new(a*b.real, a*b.imag); \ } \ \ inline TP TPROOT##_rdivide(REAL_TP z1, TP z2) \ { \ if (fabs(z2.real) <= fabs(z2.imag)) { \ REAL_TP ratio = z2.real / z2.imag; \ REAL_TP denom = z2.imag * (1 + ratio * ratio); \ return TPROOT##_new((z1 * ratio) / denom, - z1 / denom); \ } \ else { \ REAL_TP ratio = z2.imag / z2.real; \ REAL_TP denom = z2.real * (1 + ratio * ratio); \ return TPROOT##_new(z1 / denom, - (z1 * ratio) / denom); \ } \ } \ \ inline TP TPROOT##_divide(TP z1, TP z2) \ { \ REAL_TP ratio, denom, a, b, c, d; \ \ if (fabs(z2.real) <= fabs(z2.imag)) { \ ratio = z2.real / z2.imag; \ denom = z2.imag; \ a = z1.imag; \ b = z1.real; \ c = -z1.real; \ d = z1.imag; \ } \ else { \ ratio = z2.imag / z2.real; \ denom = z2.real; \ a = z1.real; \ b = z1.imag; \ c = z1.imag; \ d = -z1.real; \ } \ denom *= (1 + ratio * ratio); \ return TPROOT##_new( \ (a + b * ratio) / denom, \ (c + d * ratio) / denom); \ } \ \ inline TP TPROOT##_divider(TP a, REAL_TP b) \ { \ return TPROOT##_new(a.real/b, a.imag/b); \ } \ \ inline TP TPROOT##_pow(TP a, TP b) \ { \ REAL_TP logr = log(hypot(a.real, a.imag)); \ REAL_TP logi = atan2(a.imag, a.real); \ REAL_TP x = exp(logr * b.real - logi * b.imag); \ REAL_TP y = logr * b.imag + logi * b.real; \ \ REAL_TP cosy; \ REAL_TP siny = sincos(y, &cosy); \ return TPROOT##_new(x*cosy, x*siny); \ } \ \ inline TP TPROOT##_powr(TP a, REAL_TP b) \ { \ REAL_TP logr = log(hypot(a.real, a.imag)); \ REAL_TP logi = atan2(a.imag, a.real); \ REAL_TP x = exp(logr * b); \ REAL_TP y = logi * b; \ \ REAL_TP cosy; \ REAL_TP siny = sincos(y, &cosy); \ \ return TPROOT##_new(x * cosy, x*siny); \ } \ \ inline TP TPROOT##_rpow(REAL_TP a, TP b) \ { \ REAL_TP logr = log(a); \ REAL_TP x = exp(logr * b.real); \ REAL_TP y = logr * b.imag; \ \ REAL_TP cosy; \ REAL_TP siny = sincos(y, &cosy); \ return TPROOT##_new(x * cosy, x * siny); \ } \ \ inline TP TPROOT##_sqrt(TP a) \ { \ REAL_TP re = a.real; \ REAL_TP im = a.imag; \ REAL_TP mag = hypot(re, im); \ TP result; \ \ if (mag == 0.f) { \ result.real = result.imag = 0.f; \ } else if (re > 0.f) { \ result.real = sqrt(0.5f * (mag + re)); \ result.imag = im/result.real/2.f; \ } else { \ result.imag = sqrt(0.5f * (mag - re)); \ if (im < 0.f) \ result.imag = - result.imag; \ result.real = im/result.imag/2.f; \ } \ return result; \ } \ \ inline TP TPROOT##_exp(TP a) \ { \ REAL_TP expr = exp(a.real); \ REAL_TP cosi; \ REAL_TP sini = sincos(a.imag, &cosi); \ return TPROOT##_new(expr * cosi, expr * sini); \ } \ \ inline TP TPROOT##_log(TP a) \ { return TPROOT##_new(log(hypot(a.real, a.imag)), atan2(a.imag, a.real)); } \ \ inline TP TPROOT##_sin(TP a) \ { \ REAL_TP cosr; \ REAL_TP sinr = sincos(a.real, &cosr); \ return TPROOT##_new(sinr*cosh(a.imag), cosr*sinh(a.imag)); \ } \ \ inline TP TPROOT##_cos(TP a) \ { \ REAL_TP cosr; \ REAL_TP sinr = sincos(a.real, &cosr); \ return TPROOT##_new(cosr*cosh(a.imag), -sinr*sinh(a.imag)); \ } \ \ inline TP TPROOT##_tan(TP a) \ { \ REAL_TP re2 = 2.f * a.real; \ REAL_TP im2 = 2.f * a.imag; \ \ const REAL_TP limit = log(REAL_3LTR##_MAX); \ \ if (fabs(im2) > limit) \ return TPROOT##_new(0.f, (im2 > 0 ? 1.f : -1.f)); \ else \ { \ REAL_TP den = cos(re2) + cosh(im2); \ return TPROOT##_new(sin(re2) / den, sinh(im2) / den); \ } \ } \ \ inline TP TPROOT##_sinh(TP a) \ { \ REAL_TP cosi; \ REAL_TP sini = sincos(a.imag, &cosi); \ return TPROOT##_new(sinh(a.real)*cosi, cosh(a.real)*sini); \ } \ \ inline TP TPROOT##_cosh(TP a) \ { \ REAL_TP cosi; \ REAL_TP sini = sincos(a.imag, &cosi); \ return TPROOT##_new(cosh(a.real)*cosi, sinh(a.real)*sini); \ } \ \ inline TP TPROOT##_tanh(TP a) \ { \ REAL_TP re2 = 2.f * a.real; \ REAL_TP im2 = 2.f * a.imag; \ \ const REAL_TP limit = log(REAL_3LTR##_MAX); \ \ if (fabs(re2) > limit) \ return TPROOT##_new((re2 > 0 ? 1.f : -1.f), 0.f); \ else \ { \ REAL_TP den = cosh(re2) + cos(im2); \ return TPROOT##_new(sinh(re2) / den, sin(im2) / den); \ } \ } \ // This is undocumented and may disappear at any time #if PYOPENCL_COMPLEX_ENABLE_EXTENDED_ALIGNMENT #define PYOPENCL_COMPLEX_ALIGNMENT(TYPE) 2*sizeof(TYPE) #else #define PYOPENCL_COMPLEX_ALIGNMENT(TYPE) sizeof(TYPE) #endif #define PYOPENCL_DECLARE_COMPLEX_TYPE(BASE, BASE_3LTR) \ typedef union \ { \ struct { BASE x, y; } \ __attribute__ ((aligned (PYOPENCL_COMPLEX_ALIGNMENT(BASE)))); \ struct { BASE real, imag; } \ __attribute__ ((aligned (PYOPENCL_COMPLEX_ALIGNMENT(BASE)))); \ } c##BASE##_t; \ \ PYOPENCL_DECLARE_COMPLEX_TYPE_INT(BASE, BASE_3LTR, c##BASE, c##BASE##_t) PYOPENCL_DECLARE_COMPLEX_TYPE(float, FLT); #define cfloat_cast(a) cfloat_new((a).real, (a).imag) #ifdef PYOPENCL_DEFINE_CDOUBLE PYOPENCL_DECLARE_COMPLEX_TYPE(double, DBL); #define cdouble_cast(a) cdouble_new((a).real, (a).imag) #endif #undef PYOPENCL_COMPLEX_ALIGNMENT pyopencl-2025.1/pyopencl/cl/pyopencl-eval-tbl.cl0000644000000000000000000000507014332717401016472 0ustar00// Pieced together from Boost C++ and Cephes by // Andreas Kloeckner (C) 2012 // // Pieces from: // // Copyright (c) 2006 Xiaogang Zhang, John Maddock // Use, modification and distribution are subject to the // Boost Software License, Version 1.0. (See // http://www.boost.org/LICENSE_1_0.txt) // // Cephes Math Library Release 2.8: June, 2000 // Copyright 1984, 1987, 1989, 1992, 2000 by Stephen L. Moshier // What you see here may be used freely, but it comes with no support or // guarantee. #pragma once typedef double special_func_scalar_type; // {{{ cephes_polevl /* * DESCRIPTION: * * Evaluates polynomial of degree N: * * 2 N * y = C + C x + C x +...+ C x * 0 1 2 N * * Coefficients are stored in reverse order: * * coef[0] = C , ..., coef[N] = C . * N 0 * * The function p1evl() assumes that coef[N] = 1.0 and is * omitted from the array. Its calling arguments are * otherwise the same as polevl(). * */ special_func_scalar_type cephes_polevl(special_func_scalar_type x, __constant const special_func_scalar_type *coef, int N) { special_func_scalar_type ans; int i; __constant const special_func_scalar_type *p; p = coef; ans = *p++; i = N; do ans = ans * x + *p++; while( --i ); return( ans ); } // }}} // {{{ cephes_p1evl special_func_scalar_type cephes_p1evl( special_func_scalar_type x, __constant const special_func_scalar_type *coef, int N ) { special_func_scalar_type ans; __constant const special_func_scalar_type *p; int i; p = coef; ans = x + *p++; i = N-1; do ans = ans * x + *p++; while( --i ); return( ans ); } // }}} // {{{ boost_evaluate_rational special_func_scalar_type boost_evaluate_rational_backend(__constant const special_func_scalar_type* num, __constant const special_func_scalar_type* denom, special_func_scalar_type z, int count) { special_func_scalar_type s1, s2; if(z <= 1) { s1 = num[count-1]; s2 = denom[count-1]; for(int i = (int)count - 2; i >= 0; --i) { s1 *= z; s2 *= z; s1 += num[i]; s2 += denom[i]; } } else { z = 1 / z; s1 = num[0]; s2 = denom[0]; for(unsigned i = 1; i < count; ++i) { s1 *= z; s2 *= z; s1 += num[i]; s2 += denom[i]; } } return s1 / s2; } #define boost_evaluate_rational(num, denom, z) \ boost_evaluate_rational_backend(num, denom, z, sizeof(num)/sizeof(special_func_scalar_type)) // }}} // vim: fdm=marker pyopencl-2025.1/pyopencl/cl/pyopencl-hankel-complex.cl0000644000000000000000000007551114332717401017702 0ustar00/* Evaluate Hankel function of first kind of order 0 and 1 for argument z anywhere in the complex plane. Copyright (C) Vladimir Rokhlin Copyright (C) 2010-2012 Leslie Greengard and Zydrunas Gimbutas Copyright (C) 2015 Andreas Kloeckner Auto-translated from https://github.com/zgimbutas/fmmlib2d/blob/master/src/hank103.f using https://github.com/inducer/pyopencl/tree/master/contrib/fortran-to-opencl Originally licensed under GPL, permission to license under MIT granted via email by Vladimir Rokhlin on May 25, 2015 and by Zydrunas Gimbutas on May 17, 2015. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ void hank103(cdouble_t z, cdouble_t *h0, cdouble_t *h1, int ifexpon); void hank103u(cdouble_t z, int *ier, cdouble_t *h0, cdouble_t *h1, int ifexpon); void hank103p(__constant cdouble_t *p, int m, cdouble_t z, cdouble_t *f); void hank103a(cdouble_t z, cdouble_t *h0, cdouble_t *h1, int ifexpon); void hank103l(cdouble_t z, cdouble_t *h0, cdouble_t *h1, int ifexpon); void hank103r(cdouble_t z, int *ier, cdouble_t *h0, cdouble_t *h1, int ifexpon); /* * this subroutine evaluates the hankel functions H_0^1, H_1^1 * for an arbitrary user-specified complex number z. The user * also has the option of evaluating the functions h0, h1 * scaled by the (complex) coefficient e^{-i \cdot z}. This * subroutine is a modification of the subroutine hank102 * (see), different from the latter by having the parameter * ifexpon. Please note that the subroutine hank102 is in * turn a slightly accelerated version of the old hank101 * (see). The principal claim to fame of all three is that * they are valid on the whole complex plane, and are * reasonably accurate (14-digit relative accuracy) and * reasonably fast. Also, please note that all three have not * been carefully tested in the third quadrant (both x and y * negative); some sort of numerical trouble is possible * (though has not been observed) for LARGE z in the third * quadrant. * * ifexpon = 1 will cause the subroutine to evaluate the Hankel functions * honestly * ifexpon = 0 will cause the subroutine to scale the Hankel functions * by e^{-i \cdot z}. */ void hankel_01_complex(cdouble_t z, cdouble_t *h0, cdouble_t *h1, int ifexpon) { cdouble_t cclog; cdouble_t cd; cdouble_t fj0; cdouble_t fj1; cdouble_t h0r; cdouble_t h0u; cdouble_t h1r; cdouble_t h1u; double half_; int ier; cdouble_t ima = cdouble_new(0.0e0, 1.0e0); double pi = 0.31415926535897932e+01; cdouble_t ser2; cdouble_t ser3; double subt; cdouble_t y0; cdouble_t y1; cdouble_t z2; cdouble_t zr; cdouble_t zu; if (cdouble_imag(z) < 0) goto label_1400; hank103u(z, & ier, & (* h0), & (* h1), ifexpon); return; label_1400: ; if (cdouble_real(z) < 0) goto label_2000; hank103r(z, & ier, & (* h0), & (* h1), ifexpon); return; label_2000: ; zu = cdouble_conj(z); zr = cdouble_rmul(- 1, zu); hank103u(zu, & ier, & h0u, & h1u, ifexpon); hank103r(zr, & ier, & h0r, & h1r, ifexpon); if (ifexpon == 1) goto label_3000; subt = fabs(cdouble_imag(zu)); cd = cdouble_exp(cdouble_add(cdouble_fromreal((- 1) * subt), cdouble_mul(ima, zu))); h0u = cdouble_mul(h0u, cd); h1u = cdouble_mul(h1u, cd); cd = cdouble_exp(cdouble_add(cdouble_fromreal((- 1) * subt), cdouble_mul(ima, zr))); h0r = cdouble_mul(h0r, cd); h1r = cdouble_mul(h1r, cd); label_3000: ; half_ = 1; half_ = half_ / 2; y0 = cdouble_divide(cdouble_rmul(half_, cdouble_add(h0u, h0r)), ima); fj0 = cdouble_rmul((- 1) * half_, cdouble_add(h0u, cdouble_rmul(- 1, h0r))); y1 = cdouble_divide(cdouble_rmul((- 1) * half_, cdouble_add(h1u, cdouble_rmul(- 1, h1r))), ima); fj1 = cdouble_rmul(half_, cdouble_add(h1u, h1r)); z2 = cdouble_rmul(- 1, cdouble_conj(z)); cclog = cdouble_log(z2); ser2 = cdouble_add(y0, cdouble_rmul(- 1, cdouble_mul(cdouble_divider(cdouble_rmul(2, fj0), pi), cclog))); ser3 = cdouble_add(y1, cdouble_rmul(- 1, cdouble_mul(cdouble_divider(cdouble_rmul(2, fj1), pi), cclog))); fj0 = cdouble_conj(fj0); fj1 = cdouble_rmul(- 1, cdouble_conj(fj1)); ser2 = cdouble_conj(ser2); ser3 = cdouble_rmul(- 1, cdouble_conj(ser3)); cclog = cdouble_log(z); y0 = cdouble_add(ser2, cdouble_mul(cdouble_divider(cdouble_rmul(2, fj0), pi), cclog)); y1 = cdouble_add(ser3, cdouble_mul(cdouble_divider(cdouble_rmul(2, fj1), pi), cclog)); * h0 = cdouble_add(fj0, cdouble_mul(ima, y0)); * h1 = cdouble_add(fj1, cdouble_mul(ima, y1)); if (ifexpon == 1) return; cd = cdouble_exp(cdouble_add(cdouble_fromreal(subt), cdouble_rmul(- 1, cdouble_mul(ima, z)))); * h0 = cdouble_mul(* h0, cd); * h1 = cdouble_mul(* h1, cd); } __constant double hank103u_c0p1[] = {(- 1) * 0.6619836118357782e-12, (- 1) * 0.6619836118612709e-12, (- 1) * 0.7307514264754200e-21, 0.3928160926261892e-10, 0.5712712520172854e-09, (- 1) * 0.5712712519967086e-09, (- 1) * 0.1083820384008718e-07, (- 1) * 0.1894529309455499e-18, 0.7528123700585197e-07, 0.7528123700841491e-07, 0.1356544045548053e-16, (- 1) * 0.8147940452202855e-06, (- 1) * 0.3568198575016769e-05, 0.3568198574899888e-05, 0.2592083111345422e-04, 0.4209074870019400e-15, (- 1) * 0.7935843289157352e-04, (- 1) * 0.7935843289415642e-04, (- 1) * 0.6848330800445365e-14, 0.4136028298630129e-03, 0.9210433149997867e-03, (- 1) * 0.9210433149680665e-03, (- 1) * 0.3495306809056563e-02, (- 1) * 0.6469844672213905e-13, 0.5573890502766937e-02, 0.5573890503000873e-02, 0.3767341857978150e-12, (- 1) * 0.1439178509436339e-01, (- 1) * 0.1342403524448708e-01, 0.1342403524340215e-01, 0.8733016209933828e-02, 0.1400653553627576e-11, 0.2987361261932706e-01, 0.2987361261607835e-01, (- 1) * 0.3388096836339433e-11, (- 1) * 0.1690673895793793e+00, 0.2838366762606121e+00, (- 1) * 0.2838366762542546e+00, 0.7045107746587499e+00, (- 1) * 0.5363893133864181e-11, (- 1) * 0.7788044738211666e+00, (- 1) * 0.7788044738130360e+00, 0.5524779104964783e-11, 0.1146003459721775e+01, 0.6930697486173089e+00, (- 1) * 0.6930697486240221e+00, (- 1) * 0.7218270272305891e+00, 0.3633022466839301e-11, 0.3280924142354455e+00, 0.3280924142319602e+00, (- 1) * 0.1472323059106612e-11, (- 1) * 0.2608421334424268e+00, (- 1) * 0.9031397649230536e-01, 0.9031397649339185e-01, 0.5401342784296321e-01, (- 1) * 0.3464095071668884e-12, (- 1) * 0.1377057052946721e-01, (- 1) * 0.1377057052927901e-01, 0.4273263742980154e-13, 0.5877224130705015e-02, 0.1022508471962664e-02, (- 1) * 0.1022508471978459e-02, (- 1) * 0.2789107903871137e-03, 0.2283984571396129e-14, 0.2799719727019427e-04, 0.2799719726970900e-04, (- 1) * 0.3371218242141487e-16, (- 1) * 0.3682310515545645e-05, (- 1) * 0.1191412910090512e-06, 0.1191412910113518e-06}; __constant double hank103u_c0p2[] = {0.5641895835516786e+00, (- 1) * 0.5641895835516010e+00, (- 1) * 0.3902447089770041e-09, (- 1) * 0.3334441074447365e-11, (- 1) * 0.7052368835911731e-01, (- 1) * 0.7052368821797083e-01, 0.1957299315085370e-08, (- 1) * 0.3126801711815631e-06, (- 1) * 0.3967331737107949e-01, 0.3967327747706934e-01, 0.6902866639752817e-04, 0.3178420816292497e-06, 0.4080457166061280e-01, 0.4080045784614144e-01, (- 1) * 0.2218731025620065e-04, 0.6518438331871517e-02, 0.9798339748600499e-01, (- 1) * 0.9778028374972253e-01, (- 1) * 0.3151825524811773e+00, (- 1) * 0.7995603166188139e-03, 0.1111323666639636e+01, 0.1116791178994330e+01, 0.1635711249533488e-01, (- 1) * 0.8527067497983841e+01, (- 1) * 0.2595553689471247e+02, 0.2586942834408207e+02, 0.1345583522428299e+03, 0.2002017907999571e+00, (- 1) * 0.3086364384881525e+03, (- 1) * 0.3094609382885628e+03, (- 1) * 0.1505974589617013e+01, 0.1250150715797207e+04, 0.2205210257679573e+04, (- 1) * 0.2200328091885836e+04, (- 1) * 0.6724941072552172e+04, (- 1) * 0.7018887749450317e+01, 0.8873498980910335e+04, 0.8891369384353965e+04, 0.2008805099643591e+02, (- 1) * 0.2030681426035686e+05, (- 1) * 0.2010017782384992e+05, 0.2006046282661137e+05, 0.3427941581102808e+05, 0.3432892927181724e+02, (- 1) * 0.2511417407338804e+05, (- 1) * 0.2516567363193558e+05, (- 1) * 0.3318253740485142e+02, 0.3143940826027085e+05, 0.1658466564673543e+05, (- 1) * 0.1654843151976437e+05, (- 1) * 0.1446345041326510e+05, (- 1) * 0.1645433213663233e+02, 0.5094709396573681e+04, 0.5106816671258367e+04, 0.3470692471612145e+01, (- 1) * 0.2797902324245621e+04, (- 1) * 0.5615581955514127e+03, 0.5601021281020627e+03, 0.1463856702925587e+03, 0.1990076422327786e+00, (- 1) * 0.9334741618922085e+01, (- 1) * 0.9361368967669095e+01}; __constant double hank103u_c1p1[] = {0.4428361927253983e-12, (- 1) * 0.4428361927153559e-12, (- 1) * 0.2575693161635231e-10, (- 1) * 0.2878656317479645e-21, 0.3658696304107867e-09, 0.3658696304188925e-09, 0.7463138750413651e-19, (- 1) * 0.6748894854135266e-08, (- 1) * 0.4530098210372099e-07, 0.4530098210271137e-07, 0.4698787882823243e-06, 0.5343848349451927e-17, (- 1) * 0.1948662942158171e-05, (- 1) * 0.1948662942204214e-05, (- 1) * 0.1658085463182409e-15, 0.1316906100496570e-04, 0.3645368564036497e-04, (- 1) * 0.3645368563934748e-04, (- 1) * 0.1633458547818390e-03, (- 1) * 0.2697770638600506e-14, 0.2816784976551660e-03, 0.2816784976676616e-03, 0.2548673351180060e-13, (- 1) * 0.6106478245116582e-03, 0.2054057459296899e-03, (- 1) * 0.2054057460218446e-03, (- 1) * 0.6254962367291260e-02, 0.1484073406594994e-12, 0.1952900562500057e-01, 0.1952900562457318e-01, (- 1) * 0.5517611343746895e-12, (- 1) * 0.8528074392467523e-01, (- 1) * 0.1495138141086974e+00, 0.1495138141099772e+00, 0.4394907314508377e+00, (- 1) * 0.1334677126491326e-11, (- 1) * 0.1113740586940341e+01, (- 1) * 0.1113740586937837e+01, 0.2113005088866033e-11, 0.1170212831401968e+01, 0.1262152242318805e+01, (- 1) * 0.1262152242322008e+01, (- 1) * 0.1557810619605511e+01, 0.2176383208521897e-11, 0.8560741701626648e+00, 0.8560741701600203e+00, (- 1) * 0.1431161194996653e-11, (- 1) * 0.8386735092525187e+00, (- 1) * 0.3651819176599290e+00, 0.3651819176613019e+00, 0.2811692367666517e+00, (- 1) * 0.5799941348040361e-12, (- 1) * 0.9494630182937280e-01, (- 1) * 0.9494630182894480e-01, 0.1364615527772751e-12, 0.5564896498129176e-01, 0.1395239688792536e-01, (- 1) * 0.1395239688799950e-01, (- 1) * 0.5871314703753967e-02, 0.1683372473682212e-13, 0.1009157100083457e-02, 0.1009157100077235e-02, (- 1) * 0.8997331160162008e-15, (- 1) * 0.2723724213360371e-03, (- 1) * 0.2708696587599713e-04, 0.2708696587618830e-04, 0.3533092798326666e-05, (- 1) * 0.1328028586935163e-16, (- 1) * 0.1134616446885126e-06, (- 1) * 0.1134616446876064e-06}; __constant double hank103u_c1p2[] = {(- 1) * 0.5641895835446003e+00, (- 1) * 0.5641895835437973e+00, 0.3473016376419171e-10, (- 1) * 0.3710264617214559e-09, 0.2115710836381847e+00, (- 1) * 0.2115710851180242e+00, 0.3132928887334847e-06, 0.2064187785625558e-07, (- 1) * 0.6611954881267806e-01, (- 1) * 0.6611997176900310e-01, (- 1) * 0.3386004893181560e-05, 0.7146557892862998e-04, (- 1) * 0.5728505088320786e-01, 0.5732906930408979e-01, (- 1) * 0.6884187195973806e-02, (- 1) * 0.2383737409286457e-03, 0.1170452203794729e+00, 0.1192356405185651e+00, 0.8652871239920498e-02, (- 1) * 0.3366165876561572e+00, (- 1) * 0.1203989383538728e+01, 0.1144625888281483e+01, 0.9153684260534125e+01, 0.1781426600949249e+00, (- 1) * 0.2740411284066946e+02, (- 1) * 0.2834461441294877e+02, (- 1) * 0.2192611071606340e+01, 0.1445470231392735e+03, 0.3361116314072906e+03, (- 1) * 0.3270584743216529e+03, (- 1) * 0.1339254798224146e+04, (- 1) * 0.1657618537130453e+02, 0.2327097844591252e+04, 0.2380960024514808e+04, 0.7760611776965994e+02, (- 1) * 0.7162513471480693e+04, (- 1) * 0.9520608696419367e+04, 0.9322604506839242e+04, 0.2144033447577134e+05, 0.2230232555182369e+03, (- 1) * 0.2087584364240919e+05, (- 1) * 0.2131762020653283e+05, (- 1) * 0.3825699231499171e+03, 0.3582976792594737e+05, 0.2642632405857713e+05, (- 1) * 0.2585137938787267e+05, (- 1) * 0.3251446505037506e+05, (- 1) * 0.3710875194432116e+03, 0.1683805377643986e+05, 0.1724393921722052e+05, 0.1846128226280221e+03, (- 1) * 0.1479735877145448e+05, (- 1) * 0.5258288893282565e+04, 0.5122237462705988e+04, 0.2831540486197358e+04, 0.3905972651440027e+02, (- 1) * 0.5562781548969544e+03, (- 1) * 0.5726891190727206e+03, (- 1) * 0.2246192560136119e+01, 0.1465347141877978e+03, 0.9456733342595993e+01, (- 1) * 0.9155767836700837e+01}; void hank103u(cdouble_t z, int *ier, cdouble_t *h0, cdouble_t *h1, int ifexpon) { cdouble_t ccex; cdouble_t cd; double com; double d; double done; cdouble_t ima = cdouble_new(0.0e0, 1.0e0); int m; double thresh1; double thresh2; double thresh3; cdouble_t zzz9; * ier = 0; com = cdouble_real(z); if (cdouble_imag(z) >= 0) goto label_1200; * ier = 4; return; label_1200: ; done = 1; thresh1 = 1; thresh2 = 3.7 * 3.7; thresh3 = 400; d = cdouble_real(cdouble_mul(z, cdouble_conj(z))); if ((d < thresh1) || (d > thresh3)) goto label_3000; if (d > thresh2) goto label_2000; cd = cdouble_rdivide(done, cdouble_sqrt(z)); ccex = cd; if (ifexpon == 1) ccex = cdouble_mul(ccex, cdouble_exp(cdouble_mul(ima, z))); zzz9 = cdouble_powr(z, 9); m = 35; hank103p((__constant cdouble_t *) (& (* hank103u_c0p1)), m, cd, & (* h0)); * h0 = cdouble_mul(cdouble_mul(* h0, ccex), zzz9); hank103p((__constant cdouble_t *) (& (* hank103u_c1p1)), m, cd, & (* h1)); * h1 = cdouble_mul(cdouble_mul(* h1, ccex), zzz9); return; label_2000: ; cd = cdouble_rdivide(done, cdouble_sqrt(z)); ccex = cd; if (ifexpon == 1) ccex = cdouble_mul(ccex, cdouble_exp(cdouble_mul(ima, z))); m = 31; hank103p((__constant cdouble_t *) (& (* hank103u_c0p2)), m, cd, & (* h0)); * h0 = cdouble_mul(* h0, ccex); m = 31; hank103p((__constant cdouble_t *) (& (* hank103u_c1p2)), m, cd, & (* h1)); * h1 = cdouble_mul(* h1, ccex); return; label_3000: ; if (d > 50.e0) goto label_4000; hank103l(z, & (* h0), & (* h1), ifexpon); return; label_4000: ; hank103a(z, & (* h0), & (* h1), ifexpon); } void hank103p(__constant cdouble_t *p, int m, cdouble_t z, cdouble_t *f) { int i; * f = p[m - 1]; for (i = m + (- 1); i >= 1; i += - 1) { * f = cdouble_add(cdouble_mul(* f, z), p[i - 1]); label_1200: ; } } __constant double hank103a_p[] = {0.1000000000000000e+01, (- 1) * 0.7031250000000000e-01, 0.1121520996093750e+00, (- 1) * 0.5725014209747314e+00, 0.6074042001273483e+01, (- 1) * 0.1100171402692467e+03, 0.3038090510922384e+04, (- 1) * 0.1188384262567833e+06, 0.6252951493434797e+07, (- 1) * 0.4259392165047669e+09, 0.3646840080706556e+11, (- 1) * 0.3833534661393944e+13, 0.4854014686852901e+15, (- 1) * 0.7286857349377657e+17, 0.1279721941975975e+20, (- 1) * 0.2599382102726235e+22, 0.6046711487532401e+24, (- 1) * 0.1597065525294211e+27}; __constant double hank103a_p1[] = {0.1000000000000000e+01, 0.1171875000000000e+00, (- 1) * 0.1441955566406250e+00, 0.6765925884246826e+00, (- 1) * 0.6883914268109947e+01, 0.1215978918765359e+03, (- 1) * 0.3302272294480852e+04, 0.1276412726461746e+06, (- 1) * 0.6656367718817687e+07, 0.4502786003050393e+09, (- 1) * 0.3833857520742789e+11, 0.4011838599133198e+13, (- 1) * 0.5060568503314726e+15, 0.7572616461117957e+17, (- 1) * 0.1326257285320556e+20, 0.2687496750276277e+22, (- 1) * 0.6238670582374700e+24, 0.1644739123064188e+27}; __constant double hank103a_q[] = {(- 1) * 0.1250000000000000e+00, 0.7324218750000000e-01, (- 1) * 0.2271080017089844e+00, 0.1727727502584457e+01, (- 1) * 0.2438052969955606e+02, 0.5513358961220206e+03, (- 1) * 0.1825775547429317e+05, 0.8328593040162893e+06, (- 1) * 0.5006958953198893e+08, 0.3836255180230434e+10, (- 1) * 0.3649010818849834e+12, 0.4218971570284096e+14, (- 1) * 0.5827244631566907e+16, 0.9476288099260110e+18, (- 1) * 0.1792162323051699e+21, 0.3900121292034000e+23, (- 1) * 0.9677028801069847e+25, 0.2715581773544907e+28}; __constant double hank103a_q1[] = {0.3750000000000000e+00, (- 1) * 0.1025390625000000e+00, 0.2775764465332031e+00, (- 1) * 0.1993531733751297e+01, 0.2724882731126854e+02, (- 1) * 0.6038440767050702e+03, 0.1971837591223663e+05, (- 1) * 0.8902978767070679e+06, 0.5310411010968522e+08, (- 1) * 0.4043620325107754e+10, 0.3827011346598606e+12, (- 1) * 0.4406481417852279e+14, 0.6065091351222699e+16, (- 1) * 0.9833883876590680e+18, 0.1855045211579829e+21, (- 1) * 0.4027994121281017e+23, 0.9974783533410457e+25, (- 1) * 0.2794294288720121e+28}; void hank103a(cdouble_t z, cdouble_t *h0, cdouble_t *h1, int ifexpon) { cdouble_t cccexp; cdouble_t cdd; cdouble_t cdumb = cdouble_new(0.70710678118654757e+00, (- 1) * 0.70710678118654746e+00); double done = 1.0e0; int i; cdouble_t ima = cdouble_new(0.0e0, 1.0e0); int m; double pi = 0.31415926535897932e+01; cdouble_t pp; cdouble_t pp1; cdouble_t qq; cdouble_t qq1; cdouble_t zinv; cdouble_t zinv22; m = 10; zinv = cdouble_rdivide(done, z); pp = cdouble_fromreal(hank103a_p[m - 1]); pp1 = cdouble_fromreal(hank103a_p1[m - 1]); zinv22 = cdouble_mul(zinv, zinv); qq = cdouble_fromreal(hank103a_q[m - 1]); qq1 = cdouble_fromreal(hank103a_q1[m - 1]); for (i = m + (- 1); i >= 1; i += - 1) { pp = cdouble_add(cdouble_fromreal(hank103a_p[i - 1]), cdouble_mul(pp, zinv22)); pp1 = cdouble_add(cdouble_fromreal(hank103a_p1[i - 1]), cdouble_mul(pp1, zinv22)); qq = cdouble_add(cdouble_fromreal(hank103a_q[i - 1]), cdouble_mul(qq, zinv22)); qq1 = cdouble_add(cdouble_fromreal(hank103a_q1[i - 1]), cdouble_mul(qq1, zinv22)); label_1600: ; } qq = cdouble_mul(qq, zinv); qq1 = cdouble_mul(qq1, zinv); cccexp = cdouble_fromreal(1); if (ifexpon == 1) cccexp = cdouble_exp(cdouble_mul(ima, z)); cdd = cdouble_sqrt(cdouble_rmul(2 / pi, zinv)); * h0 = cdouble_add(pp, cdouble_mul(ima, qq)); * h0 = cdouble_mul(cdouble_mul(cdouble_mul(cdd, cdumb), cccexp), * h0); * h1 = cdouble_add(pp1, cdouble_mul(ima, qq1)); * h1 = cdouble_rmul(- 1, cdouble_mul(cdouble_mul(cdouble_mul(cdouble_mul(cdd, cccexp), cdumb), * h1), ima)); } __constant double hank103l_cj0[] = {0.1000000000000000e+01, (- 1) * 0.2500000000000000e+00, 0.1562500000000000e-01, (- 1) * 0.4340277777777778e-03, 0.6781684027777778e-05, (- 1) * 0.6781684027777778e-07, 0.4709502797067901e-09, (- 1) * 0.2402807549524439e-11, 0.9385966990329841e-14, (- 1) * 0.2896903392077112e-16, 0.7242258480192779e-19, (- 1) * 0.1496334396734045e-21, 0.2597802772107717e-24, (- 1) * 0.3842903509035085e-27, 0.4901662639075363e-30, (- 1) * 0.5446291821194848e-33}; __constant double hank103l_cj1[] = {(- 1) * 0.5000000000000000e+00, 0.6250000000000000e-01, (- 1) * 0.2604166666666667e-02, 0.5425347222222222e-04, (- 1) * 0.6781684027777778e-06, 0.5651403356481481e-08, (- 1) * 0.3363930569334215e-10, 0.1501754718452775e-12, (- 1) * 0.5214426105738801e-15, 0.1448451696038556e-17, (- 1) * 0.3291935672814899e-20, 0.6234726653058522e-23, (- 1) * 0.9991549123491221e-26, 0.1372465538941102e-28, (- 1) * 0.1633887546358454e-31, 0.1701966194123390e-34}; __constant double hank103l_ser2[] = {0.2500000000000000e+00, (- 1) * 0.2343750000000000e-01, 0.7957175925925926e-03, (- 1) * 0.1412850839120370e-04, 0.1548484519675926e-06, (- 1) * 0.1153828185281636e-08, 0.6230136717695511e-11, (- 1) * 0.2550971742728932e-13, 0.8195247730999099e-16, (- 1) * 0.2121234517551702e-18, 0.4518746345057852e-21, (- 1) * 0.8061529302289970e-24, 0.1222094716680443e-26, (- 1) * 0.1593806157473552e-29, 0.1807204342667468e-32, (- 1) * 0.1798089518115172e-35}; __constant double hank103l_ser2der[] = {0.5000000000000000e+00, (- 1) * 0.9375000000000000e-01, 0.4774305555555556e-02, (- 1) * 0.1130280671296296e-03, 0.1548484519675926e-05, (- 1) * 0.1384593822337963e-07, 0.8722191404773715e-10, (- 1) * 0.4081554788366291e-12, 0.1475144591579838e-14, (- 1) * 0.4242469035103405e-17, 0.9941241959127275e-20, (- 1) * 0.1934767032549593e-22, 0.3177446263369152e-25, (- 1) * 0.4462657240925946e-28, 0.5421613028002404e-31, (- 1) * 0.5753886457968550e-34}; void hank103l(cdouble_t z, cdouble_t *h0, cdouble_t *h1, int ifexpon) { cdouble_t cd; cdouble_t cdddlog; cdouble_t fj0; cdouble_t fj1; double gamma = 0.5772156649015328606e+00; int i; cdouble_t ima = cdouble_new(0.0e0, 1.0e0); int m; double pi = 0.31415926535897932e+01; double two = 2.0e0; cdouble_t y0; cdouble_t y1; cdouble_t z2; m = 16; fj0 = cdouble_fromreal(0); fj1 = cdouble_fromreal(0); y0 = cdouble_fromreal(0); y1 = cdouble_fromreal(0); z2 = cdouble_mul(z, z); cd = cdouble_fromreal(1); for (i = 1; i <= m; i += 1) { fj0 = cdouble_add(fj0, cdouble_rmul(hank103l_cj0[i - 1], cd)); fj1 = cdouble_add(fj1, cdouble_rmul(hank103l_cj1[i - 1], cd)); y1 = cdouble_add(y1, cdouble_rmul(hank103l_ser2der[i - 1], cd)); cd = cdouble_mul(cd, z2); y0 = cdouble_add(y0, cdouble_rmul(hank103l_ser2[i - 1], cd)); label_1800: ; } fj1 = cdouble_rmul(- 1, cdouble_mul(fj1, z)); cdddlog = cdouble_add(cdouble_fromreal(gamma), cdouble_log(cdouble_divider(z, two))); y0 = cdouble_add(cdouble_mul(cdddlog, fj0), y0); y0 = cdouble_rmul(two / pi, y0); y1 = cdouble_mul(y1, z); y1 = cdouble_add(cdouble_add(cdouble_rmul(- 1, cdouble_mul(cdddlog, fj1)), cdouble_divide(fj0, z)), y1); y1 = cdouble_divider(cdouble_rmul((- 1) * two, y1), pi); * h0 = cdouble_add(fj0, cdouble_mul(ima, y0)); * h1 = cdouble_add(fj1, cdouble_mul(ima, y1)); if (ifexpon == 1) return; cd = cdouble_exp(cdouble_rmul(- 1, cdouble_mul(ima, z))); * h0 = cdouble_mul(* h0, cd); * h1 = cdouble_mul(* h1, cd); } __constant double hank103r_c0p1[] = {(- 1) * 0.4268441995428495e-23, 0.4374027848105921e-23, 0.9876152216238049e-23, (- 1) * 0.1065264808278614e-20, 0.6240598085551175e-19, 0.6658529985490110e-19, (- 1) * 0.5107210870050163e-17, (- 1) * 0.2931746613593983e-18, 0.1611018217758854e-15, (- 1) * 0.1359809022054077e-15, (- 1) * 0.7718746693707326e-15, 0.6759496139812828e-14, (- 1) * 0.1067620915195442e-12, (- 1) * 0.1434699000145826e-12, 0.3868453040754264e-11, 0.7061853392585180e-12, (- 1) * 0.6220133527871203e-10, 0.3957226744337817e-10, 0.3080863675628417e-09, (- 1) * 0.1154618431281900e-08, 0.7793319486868695e-08, 0.1502570745460228e-07, (- 1) * 0.1978090852638430e-06, (- 1) * 0.7396691873499030e-07, 0.2175857247417038e-05, (- 1) * 0.8473534855334919e-06, (- 1) * 0.1053381327609720e-04, 0.2042555121261223e-04, (- 1) * 0.4812568848956982e-04, (- 1) * 0.1961519090873697e-03, 0.1291714391689374e-02, 0.9234422384950050e-03, (- 1) * 0.1113890671502769e-01, 0.9053687375483149e-03, 0.5030666896877862e-01, (- 1) * 0.4923119348218356e-01, 0.5202355973926321e+00, (- 1) * 0.1705244841954454e+00, (- 1) * 0.1134990486611273e+01, (- 1) * 0.1747542851820576e+01, 0.8308174484970718e+01, 0.2952358687641577e+01, (- 1) * 0.3286074510100263e+02, 0.1126542966971545e+02, 0.6576015458463394e+02, (- 1) * 0.1006116996293757e+03, 0.3216834899377392e+02, 0.3614005342307463e+03, (- 1) * 0.6653878500833375e+03, (- 1) * 0.6883582242804924e+03, 0.2193362007156572e+04, 0.2423724600546293e+03, (- 1) * 0.3665925878308203e+04, 0.2474933189642588e+04, 0.1987663383445796e+04, (- 1) * 0.7382586600895061e+04, 0.4991253411017503e+04, 0.1008505017740918e+05, (- 1) * 0.1285284928905621e+05, (- 1) * 0.5153674821668470e+04, 0.1301656757246985e+05, (- 1) * 0.4821250366504323e+04, (- 1) * 0.4982112643422311e+04, 0.9694070195648748e+04, (- 1) * 0.1685723189234701e+04, (- 1) * 0.6065143678129265e+04, 0.2029510635584355e+04, 0.1244402339119502e+04, (- 1) * 0.4336682903961364e+03, 0.8923209875101459e+02}; __constant double hank103r_c0p2[] = {0.5641895835569398e+00, (- 1) * 0.5641895835321127e+00, (- 1) * 0.7052370223565544e-01, (- 1) * 0.7052369923405479e-01, (- 1) * 0.3966909368581382e-01, 0.3966934297088857e-01, 0.4130698137268744e-01, 0.4136196771522681e-01, 0.6240742346896508e-01, (- 1) * 0.6553556513852438e-01, (- 1) * 0.3258849904760676e-01, (- 1) * 0.7998036854222177e-01, (- 1) * 0.3988006311955270e+01, 0.1327373751674479e+01, 0.6121789346915312e+02, (- 1) * 0.9251865216627577e+02, 0.4247064992018806e+03, 0.2692553333489150e+04, (- 1) * 0.4374691601489926e+05, (- 1) * 0.3625248208112831e+05, 0.1010975818048476e+07, (- 1) * 0.2859360062580096e+05, (- 1) * 0.1138970241206912e+08, 0.1051097979526042e+08, 0.2284038899211195e+08, (- 1) * 0.2038012515235694e+09, 0.1325194353842857e+10, 0.1937443530361381e+10, (- 1) * 0.2245999018652171e+11, (- 1) * 0.5998903865344352e+10, 0.1793237054876609e+12, (- 1) * 0.8625159882306147e+11, (- 1) * 0.5887763042735203e+12, 0.1345331284205280e+13, (- 1) * 0.2743432269370813e+13, (- 1) * 0.8894942160272255e+13, 0.4276463113794564e+14, 0.2665019886647781e+14, (- 1) * 0.2280727423955498e+15, 0.3686908790553973e+14, 0.5639846318168615e+15, (- 1) * 0.6841529051615703e+15, 0.9901426799966038e+14, 0.2798406605978152e+16, (- 1) * 0.4910062244008171e+16, (- 1) * 0.5126937967581805e+16, 0.1387292951936756e+17, 0.1043295727224325e+16, (- 1) * 0.1565204120687265e+17, 0.1215262806973577e+17, 0.3133802397107054e+16, (- 1) * 0.1801394550807078e+17, 0.4427598668012807e+16, 0.6923499968336864e+16}; __constant double hank103r_c1p1[] = {(- 1) * 0.4019450270734195e-23, (- 1) * 0.4819240943285824e-23, 0.1087220822839791e-20, 0.1219058342725899e-21, (- 1) * 0.7458149572694168e-19, 0.5677825613414602e-19, 0.8351815799518541e-18, (- 1) * 0.5188585543982425e-17, 0.1221075065755962e-15, 0.1789261470637227e-15, (- 1) * 0.6829972121890858e-14, (- 1) * 0.1497462301804588e-14, 0.1579028042950957e-12, (- 1) * 0.9414960303758800e-13, (- 1) * 0.1127570848999746e-11, 0.3883137940932639e-11, (- 1) * 0.3397569083776586e-10, (- 1) * 0.6779059427459179e-10, 0.1149529442506273e-08, 0.4363087909873751e-09, (- 1) * 0.1620182360840298e-07, 0.6404695607668289e-08, 0.9651461037419628e-07, (- 1) * 0.1948572160668177e-06, 0.6397881896749446e-06, 0.2318661930507743e-05, (- 1) * 0.1983192412396578e-04, (- 1) * 0.1294811208715315e-04, 0.2062663873080766e-03, (- 1) * 0.2867633324735777e-04, (- 1) * 0.1084309075952914e-02, 0.1227880935969686e-02, 0.2538406015667726e-03, (- 1) * 0.1153316815955356e-01, 0.4520140008266983e-01, 0.5693944718258218e-01, (- 1) * 0.9640790976658534e+00, (- 1) * 0.6517135574036008e+00, 0.2051491829570049e+01, (- 1) * 0.1124151010077572e+01, (- 1) * 0.3977380460328048e+01, 0.8200665483661009e+01, (- 1) * 0.7950131652215817e+01, (- 1) * 0.3503037697046647e+02, 0.9607320812492044e+02, 0.7894079689858070e+02, (- 1) * 0.3749002890488298e+03, (- 1) * 0.8153831134140778e+01, 0.7824282518763973e+03, (- 1) * 0.6035276543352174e+03, (- 1) * 0.5004685759675768e+03, 0.2219009060854551e+04, (- 1) * 0.2111301101664672e+04, (- 1) * 0.4035632271617418e+04, 0.7319737262526823e+04, 0.2878734389521922e+04, (- 1) * 0.1087404934318719e+05, 0.3945740567322783e+04, 0.6727823761148537e+04, (- 1) * 0.1253555346597302e+05, 0.3440468371829973e+04, 0.1383240926370073e+05, (- 1) * 0.9324927373036743e+04, (- 1) * 0.6181580304530313e+04, 0.6376198146666679e+04, (- 1) * 0.1033615527971958e+04, (- 1) * 0.1497604891055181e+04, 0.1929025541588262e+04, (- 1) * 0.4219760183545219e+02, (- 1) * 0.4521162915353207e+03}; __constant double hank103r_c1p2[] = {(- 1) * 0.5641895835431980e+00, (- 1) * 0.5641895835508094e+00, 0.2115710934750869e+00, (- 1) * 0.2115710923186134e+00, (- 1) * 0.6611607335011594e-01, (- 1) * 0.6611615414079688e-01, (- 1) * 0.5783289433408652e-01, 0.5785737744023628e-01, 0.8018419623822896e-01, 0.8189816020440689e-01, 0.1821045296781145e+00, (- 1) * 0.2179738973008740e+00, 0.5544705668143094e+00, 0.2224466316444440e+01, (- 1) * 0.8563271248520645e+02, (- 1) * 0.4394325758429441e+02, 0.2720627547071340e+04, (- 1) * 0.6705390850875292e+03, (- 1) * 0.3936221960600770e+05, 0.5791730432605451e+05, (- 1) * 0.1976787738827811e+06, (- 1) * 0.1502498631245144e+07, 0.2155317823990686e+08, 0.1870953796705298e+08, (- 1) * 0.4703995711098311e+09, 0.3716595906453190e+07, 0.5080557859012385e+10, (- 1) * 0.4534199223888966e+10, (- 1) * 0.1064438211647413e+11, 0.8612243893745942e+11, (- 1) * 0.5466017687785078e+12, (- 1) * 0.8070950386640701e+12, 0.9337074941225827e+13, 0.2458379240643264e+13, (- 1) * 0.7548692171244579e+14, 0.3751093169954336e+14, 0.2460677431350039e+15, (- 1) * 0.5991919372881911e+15, 0.1425679408434606e+16, 0.4132221939781502e+16, (- 1) * 0.2247506469468969e+17, (- 1) * 0.1269771078165026e+17, 0.1297336292749026e+18, (- 1) * 0.2802626909791308e+17, (- 1) * 0.3467137222813017e+18, 0.4773955215582192e+18, (- 1) * 0.2347165776580206e+18, (- 1) * 0.2233638097535785e+19, 0.5382350866778548e+19, 0.4820328886922998e+19, (- 1) * 0.1928978948099345e+20, 0.1575498747750907e+18, 0.3049162180215152e+20, (- 1) * 0.2837046201123502e+20, (- 1) * 0.5429391644354291e+19, 0.6974653380104308e+20, (- 1) * 0.5322120857794536e+20, (- 1) * 0.6739879079691706e+20, 0.6780343087166473e+20, 0.1053455984204666e+20, (- 1) * 0.2218784058435737e+20, 0.1505391868530062e+20}; void hank103r(cdouble_t z, int *ier, cdouble_t *h0, cdouble_t *h1, int ifexpon) { cdouble_t cccexp; cdouble_t cd; cdouble_t cdd; double d; double done; cdouble_t ima = cdouble_new(0.0e0, 1.0e0); int m; double thresh1; double thresh2; double thresh3; cdouble_t zz18; * ier = 0; if ((cdouble_real(z) >= 0) && (cdouble_imag(z) <= 0)) goto label_1400; * ier = 4; return; label_1400: ; done = 1; thresh1 = 16; thresh2 = 64; thresh3 = 400; d = cdouble_real(cdouble_mul(z, cdouble_conj(z))); if ((d < thresh1) || (d > thresh3)) goto label_3000; if (d > thresh2) goto label_2000; cccexp = cdouble_fromreal(1); if (ifexpon == 1) cccexp = cdouble_exp(cdouble_mul(ima, z)); cdd = cdouble_rdivide(done, cdouble_sqrt(z)); cd = cdouble_rdivide(done, z); zz18 = cdouble_powr(z, 18); m = 35; hank103p((__constant cdouble_t *) (& (* hank103r_c0p1)), m, cd, & (* h0)); * h0 = cdouble_mul(cdouble_mul(cdouble_mul(* h0, cdd), cccexp), zz18); hank103p((__constant cdouble_t *) (& (* hank103r_c1p1)), m, cd, & (* h1)); * h1 = cdouble_mul(cdouble_mul(cdouble_mul(* h1, cdd), cccexp), zz18); return; label_2000: ; cd = cdouble_rdivide(done, z); cdd = cdouble_sqrt(cd); cccexp = cdouble_fromreal(1); if (ifexpon == 1) cccexp = cdouble_exp(cdouble_mul(ima, z)); m = 27; hank103p((__constant cdouble_t *) (& (* hank103r_c0p2)), m, cd, & (* h0)); * h0 = cdouble_mul(cdouble_mul(* h0, cccexp), cdd); m = 31; hank103p((__constant cdouble_t *) (& (* hank103r_c1p2)), m, cd, & (* h1)); * h1 = cdouble_mul(cdouble_mul(* h1, cccexp), cdd); return; label_3000: ; if (d > 50.e0) goto label_4000; hank103l(z, & (* h0), & (* h1), ifexpon); return; label_4000: ; hank103a(z, & (* h0), & (* h1), ifexpon); } pyopencl-2025.1/pyopencl/cl/pyopencl-random123/array.h0000644000000000000000000004130014332717401017435 0ustar00/* Copyright 2010-2011, D. E. Shaw Research. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of D. E. Shaw Research nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifndef _r123array_dot_h__ #define _r123array_dot_h__ #include "openclfeatures.h" #ifndef __cplusplus #define CXXMETHODS(_N, W, T) #define CXXOVERLOADS(_N, W, T) #else #include #include #include #include #include #include /** @defgroup arrayNxW The r123arrayNxW classes Each of the r123arrayNxW is a fixed size array of N W-bit unsigned integers. It is functionally equivalent to the C++0x std::array, but does not require C++0x features or libraries. In addition to meeting most of the requirements of a Container, it also has a member function, incr(), which increments the zero-th element and carrys overflows into higher indexed elements. Thus, by using incr(), sequences of up to 2^(N*W) distinct values can be produced. If SSE is supported by the compiler, then the class r123array1xm128i is also defined, in which the data member is an array of one r123128i object. @cond HIDDEN_FROM_DOXYGEN */ template inline R123_CUDA_DEVICE value_type assemble_from_u32(uint32_t *p32){ value_type v=0; for(size_t i=0; i<(3+sizeof(value_type))/4; ++i) v |= ((value_type)(*p32++)) << (32*i); return v; } // Work-alike methods and typedefs modeled on std::array: #define CXXMETHODS(_N, W, T) \ typedef T value_type; \ typedef T* iterator; \ typedef const T* const_iterator; \ typedef value_type& reference; \ typedef const value_type& const_reference; \ typedef size_t size_type; \ typedef ptrdiff_t difference_type; \ typedef T* pointer; \ typedef const T* const_pointer; \ typedef std::reverse_iterator reverse_iterator; \ typedef std::reverse_iterator const_reverse_iterator; \ /* Boost.array has static_size. C++11 specializes tuple_size */ \ enum {static_size = _N}; \ R123_CUDA_DEVICE reference operator[](size_type i){return v[i];} \ R123_CUDA_DEVICE const_reference operator[](size_type i) const {return v[i];} \ R123_CUDA_DEVICE reference at(size_type i){ if(i >= _N) R123_THROW(std::out_of_range("array index out of range")); return (*this)[i]; } \ R123_CUDA_DEVICE const_reference at(size_type i) const { if(i >= _N) R123_THROW(std::out_of_range("array index out of range")); return (*this)[i]; } \ R123_CUDA_DEVICE size_type size() const { return _N; } \ R123_CUDA_DEVICE size_type max_size() const { return _N; } \ R123_CUDA_DEVICE bool empty() const { return _N==0; }; \ R123_CUDA_DEVICE iterator begin() { return &v[0]; } \ R123_CUDA_DEVICE iterator end() { return &v[_N]; } \ R123_CUDA_DEVICE const_iterator begin() const { return &v[0]; } \ R123_CUDA_DEVICE const_iterator end() const { return &v[_N]; } \ R123_CUDA_DEVICE const_iterator cbegin() const { return &v[0]; } \ R123_CUDA_DEVICE const_iterator cend() const { return &v[_N]; } \ R123_CUDA_DEVICE reverse_iterator rbegin(){ return reverse_iterator(end()); } \ R123_CUDA_DEVICE const_reverse_iterator rbegin() const{ return const_reverse_iterator(end()); } \ R123_CUDA_DEVICE reverse_iterator rend(){ return reverse_iterator(begin()); } \ R123_CUDA_DEVICE const_reverse_iterator rend() const{ return const_reverse_iterator(begin()); } \ R123_CUDA_DEVICE const_reverse_iterator crbegin() const{ return const_reverse_iterator(cend()); } \ R123_CUDA_DEVICE const_reverse_iterator crend() const{ return const_reverse_iterator(cbegin()); } \ R123_CUDA_DEVICE pointer data(){ return &v[0]; } \ R123_CUDA_DEVICE const_pointer data() const{ return &v[0]; } \ R123_CUDA_DEVICE reference front(){ return v[0]; } \ R123_CUDA_DEVICE const_reference front() const{ return v[0]; } \ R123_CUDA_DEVICE reference back(){ return v[_N-1]; } \ R123_CUDA_DEVICE const_reference back() const{ return v[_N-1]; } \ R123_CUDA_DEVICE bool operator==(const r123array##_N##x##W& rhs) const{ \ /* CUDA3 does not have std::equal */ \ for (size_t i = 0; i < _N; ++i) \ if (v[i] != rhs.v[i]) return false; \ return true; \ } \ R123_CUDA_DEVICE bool operator!=(const r123array##_N##x##W& rhs) const{ return !(*this == rhs); } \ /* CUDA3 does not have std::fill_n */ \ R123_CUDA_DEVICE void fill(const value_type& val){ for (size_t i = 0; i < _N; ++i) v[i] = val; } \ R123_CUDA_DEVICE void swap(r123array##_N##x##W& rhs){ \ /* CUDA3 does not have std::swap_ranges */ \ for (size_t i = 0; i < _N; ++i) { \ T tmp = v[i]; \ v[i] = rhs.v[i]; \ rhs.v[i] = tmp; \ } \ } \ R123_CUDA_DEVICE r123array##_N##x##W& incr(R123_ULONG_LONG n=1){ \ /* This test is tricky because we're trying to avoid spurious \ complaints about illegal shifts, yet still be compile-time \ evaulated. */ \ if(sizeof(T)>((sizeof(T)3?3:0] is to silence \ a spurious error from icpc \ */ \ ++v[_N>1?1:0]; \ if(_N==2 || R123_BUILTIN_EXPECT(!!v[_N>1?1:0], 1)) return *this; \ ++v[_N>2?2:0]; \ if(_N==3 || R123_BUILTIN_EXPECT(!!v[_N>2?2:0], 1)) return *this; \ ++v[_N>3?3:0]; \ for(size_t i=4; i<_N; ++i){ \ if( R123_BUILTIN_EXPECT(!!v[i-1], 1) ) return *this; \ ++v[i]; \ } \ return *this; \ } \ /* seed(SeedSeq) would be a constructor if having a constructor */ \ /* didn't cause headaches with defaults */ \ template \ R123_CUDA_DEVICE static r123array##_N##x##W seed(SeedSeq &ss){ \ r123array##_N##x##W ret; \ const size_t Ngen = _N*((3+sizeof(value_type))/4); \ uint32_t u32[Ngen]; \ uint32_t *p32 = &u32[0]; \ ss.generate(&u32[0], &u32[Ngen]); \ for(size_t i=0; i<_N; ++i){ \ ret.v[i] = assemble_from_u32(p32); \ p32 += (3+sizeof(value_type))/4; \ } \ return ret; \ } \ protected: \ R123_CUDA_DEVICE r123array##_N##x##W& incr_carefully(R123_ULONG_LONG n){ \ /* n may be greater than the maximum value of a single value_type */ \ value_type vtn; \ vtn = n; \ v[0] += n; \ const unsigned rshift = 8* ((sizeof(n)>sizeof(value_type))? sizeof(value_type) : 0); \ for(size_t i=1; i<_N; ++i){ \ if(rshift){ \ n >>= rshift; \ }else{ \ n=0; \ } \ if( v[i-1] < vtn ) \ ++n; \ if( n==0 ) break; \ vtn = n; \ v[i] += n; \ } \ return *this; \ } \ // There are several tricky considerations for the insertion and extraction // operators: // - we would like to be able to print r123array16x8 as a sequence of 16 integers, // not as 16 bytes. // - we would like to be able to print r123array1xm128i. // - we do not want an int conversion operator in r123m128i because it causes // lots of ambiguity problems with automatic promotions. // Solution: r123arrayinsertable and r123arrayextractable template struct r123arrayinsertable{ const T& v; r123arrayinsertable(const T& t_) : v(t_) {} friend std::ostream& operator<<(std::ostream& os, const r123arrayinsertable& t){ return os << t.v; } }; template<> struct r123arrayinsertable{ const uint8_t& v; r123arrayinsertable(const uint8_t& t_) : v(t_) {} friend std::ostream& operator<<(std::ostream& os, const r123arrayinsertable& t){ return os << (int)t.v; } }; template struct r123arrayextractable{ T& v; r123arrayextractable(T& t_) : v(t_) {} friend std::istream& operator>>(std::istream& is, r123arrayextractable& t){ return is >> t.v; } }; template<> struct r123arrayextractable{ uint8_t& v; r123arrayextractable(uint8_t& t_) : v(t_) {} friend std::istream& operator>>(std::istream& is, r123arrayextractable& t){ int i; is >> i; t.v = i; return is; } }; #define CXXOVERLOADS(_N, W, T) \ \ inline std::ostream& operator<<(std::ostream& os, const r123array##_N##x##W& a){ \ os << r123arrayinsertable(a.v[0]); \ for(size_t i=1; i<_N; ++i) \ os << " " << r123arrayinsertable(a.v[i]); \ return os; \ } \ \ inline std::istream& operator>>(std::istream& is, r123array##_N##x##W& a){ \ for(size_t i=0; i<_N; ++i){ \ r123arrayextractable x(a.v[i]); \ is >> x; \ } \ return is; \ } \ \ namespace r123{ \ typedef r123array##_N##x##W Array##_N##x##W; \ } #endif /* __cplusplus */ /* _r123array_tpl expands to a declaration of struct r123arrayNxW. In C, it's nothing more than a struct containing an array of N objects of type T. In C++ it's the same, but endowed with an assortment of member functions, typedefs and friends. In C++, r123arrayNxW looks a lot like std::array, has most of the capabilities of a container, and satisfies the requirements outlined in compat/Engine.hpp for counter and key types. ArrayNxW, in the r123 namespace is a typedef equivalent to r123arrayNxW. */ #define _r123array_tpl(_N, W, T) \ /** @ingroup arrayNxW */ \ /** @see arrayNxW */ \ struct r123array##_N##x##W{ \ T v[_N]; \ CXXMETHODS(_N, W, T) \ }; \ \ CXXOVERLOADS(_N, W, T) /** @endcond */ _r123array_tpl(1, 32, uint32_t) /* r123array1x32 */ _r123array_tpl(2, 32, uint32_t) /* r123array2x32 */ _r123array_tpl(4, 32, uint32_t) /* r123array4x32 */ _r123array_tpl(8, 32, uint32_t) /* r123array8x32 */ _r123array_tpl(1, 64, uint64_t) /* r123array1x64 */ _r123array_tpl(2, 64, uint64_t) /* r123array2x64 */ _r123array_tpl(4, 64, uint64_t) /* r123array4x64 */ _r123array_tpl(16, 8, uint8_t) /* r123array16x8 for ARSsw, AESsw */ #if R123_USE_SSE _r123array_tpl(1, m128i, r123m128i) /* r123array1x128i for ARSni, AESni */ #endif /* In C++, it's natural to use sizeof(a::value_type), but in C it's pretty convoluted to figure out the width of the value_type of an r123arrayNxW: */ #define R123_W(a) (8*sizeof(((a *)0)->v[0])) /** @namespace r123 Most of the Random123 C++ API is contained in the r123 namespace. */ #endif pyopencl-2025.1/pyopencl/cl/pyopencl-random123/openclfeatures.h0000644000000000000000000000550114332717401021341 0ustar00/* Copyright 2010-2011, D. E. Shaw Research. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of D. E. Shaw Research nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifndef __openclfeatures_dot_hpp #define __openclfeatures_dot_hpp #ifndef R123_STATIC_INLINE #define R123_STATIC_INLINE inline #endif #ifndef R123_FORCE_INLINE #define R123_FORCE_INLINE(decl) decl __attribute__((always_inline)) #endif #ifndef R123_CUDA_DEVICE #define R123_CUDA_DEVICE #endif #ifndef R123_ASSERT #define R123_ASSERT(x) #endif #ifndef R123_BUILTIN_EXPECT #define R123_BUILTIN_EXPECT(expr,likely) expr #endif #ifndef R123_USE_GNU_UINT128 #define R123_USE_GNU_UINT128 0 #endif #ifndef R123_USE_MULHILO64_ASM #define R123_USE_MULHILO64_ASM 0 #endif #ifndef R123_USE_MULHILO64_MSVC_INTRIN #define R123_USE_MULHILO64_MSVC_INTRIN 0 #endif #ifndef R123_USE_MULHILO64_CUDA_INTRIN #define R123_USE_MULHILO64_CUDA_INTRIN 0 #endif #ifndef R123_USE_MULHILO64_OPENCL_INTRIN #ifdef PYOPENCL_USING_OCLGRIND #define R123_USE_MULHILO64_OPENCL_INTRIN 0 #else #define R123_USE_MULHILO64_OPENCL_INTRIN 1 #endif #endif #ifndef R123_USE_AES_NI #define R123_USE_AES_NI 0 #endif // XXX ATI APP SDK 2.4 clBuildProgram SEGVs if one uses uint64_t instead of // ulong to mul_hi. And gets lots of complaints from stdint.h // on some machines. // But these typedefs mean we cannot include stdint.h with // these headers? Do we need R123_64T, R123_32T, R123_8T? typedef ulong uint64_t; typedef uint uint32_t; typedef uchar uint8_t; #define UINT64_C(x) ((ulong)(x##UL)) #endif pyopencl-2025.1/pyopencl/cl/pyopencl-random123/philox.cl0000644000000000000000000005235414332717401020004 0ustar00/* Copyright 2010-2011, D. E. Shaw Research. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of D. E. Shaw Research nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifndef _philox_dot_h_ #define _philox_dot_h_ /** \cond HIDDEN_FROM_DOXYGEN */ #include "openclfeatures.h" #include "array.h" /* // Macros _Foo_tpl are code generation 'templates' They define // inline functions with names obtained by mangling Foo and the // macro arguments. E.g., // _mulhilo_tpl(32, uint32_t, uint64_t) // expands to a definition of: // mulhilo32(uint32_t, uint32_t, uint32_t *, uint32_t *) // We then 'instantiate the template' to define // several different functions, e.g., // mulhilo32 // mulhilo64 // These functions will be visible to user code, and may // also be used later in subsequent templates and definitions. // A template for mulhilo using a temporary of twice the word-width. // Gcc figures out that this can be reduced to a single 'mul' instruction, // despite the apparent use of double-wide variables, shifts, etc. It's // obviously not guaranteed that all compilers will be that smart, so // other implementations might be preferable, e.g., using an intrinsic // or an asm block. On the other hand, for 32-bit multiplies, // this *is* perfectly standard C99 - any C99 compiler should // understand it and produce correct code. For 64-bit multiplies, // it's only usable if the compiler recognizes that it can do // arithmetic on a 128-bit type. That happens to be true for gcc on // x86-64, and powerpc64 but not much else. */ #define _mulhilo_dword_tpl(W, Word, Dword) \ R123_CUDA_DEVICE R123_STATIC_INLINE Word mulhilo##W(Word a, Word b, Word* hip){ \ Dword product = ((Dword)a)*((Dword)b); \ *hip = product>>W; \ return (Word)product; \ } /* // A template for mulhilo using gnu-style asm syntax. // INSN can be "mulw", "mull" or "mulq". // FIXME - porting to other architectures, we'll need still-more conditional // branching here. Note that intrinsics are usually preferable. */ #ifdef __powerpc__ #define _mulhilo_asm_tpl(W, Word, INSN) \ R123_STATIC_INLINE Word mulhilo##W(Word ax, Word b, Word *hip){ \ Word dx = 0; \ __asm__("\n\t" \ INSN " %0,%1,%2\n\t" \ : "=r"(dx) \ : "r"(b), "r"(ax) \ ); \ *hip = dx; \ return ax*b; \ } #else #define _mulhilo_asm_tpl(W, Word, INSN) \ R123_STATIC_INLINE Word mulhilo##W(Word ax, Word b, Word *hip){ \ Word dx; \ __asm__("\n\t" \ INSN " %2\n\t" \ : "=a"(ax), "=d"(dx) \ : "r"(b), "0"(ax) \ ); \ *hip = dx; \ return ax; \ } #endif /* __powerpc__ */ /* // A template for mulhilo using MSVC-style intrinsics // For example,_umul128 is an msvc intrinsic, c.f. // http://msdn.microsoft.com/en-us/library/3dayytw9.aspx */ #define _mulhilo_msvc_intrin_tpl(W, Word, INTRIN) \ R123_STATIC_INLINE Word mulhilo##W(Word a, Word b, Word* hip){ \ return INTRIN(a, b, hip); \ } /* N.B. This really should be called _mulhilo_mulhi_intrin. It just happens that CUDA was the first time we used the idiom. */ #define _mulhilo_cuda_intrin_tpl(W, Word, INTRIN) \ R123_CUDA_DEVICE R123_STATIC_INLINE Word mulhilo##W(Word a, Word b, Word* hip){ \ *hip = INTRIN(a, b); \ return a*b; \ } /* // A template for mulhilo using only word-size operations and // C99 operators (no adc, no mulhi). It // requires four multiplies and a dozen or so shifts, adds // and tests. It's not clear what this is good for, other than // completeness. On 32-bit platforms, it could be used to // implement philoxNx64, but on such platforms both the philoxNx32 // and the threefryNx64 cbrngs are going to have much better // performance. It is enabled below by R123_USE_MULHILO64_C99, // but that is currently (Sep 2011) not set by any of the // features/XXfeatures.h headers. It can, of course, be // set with a compile-time -D option. */ #define _mulhilo_c99_tpl(W, Word) \ R123_STATIC_INLINE Word mulhilo##W(Word a, Word b, Word *hip){ \ const unsigned WHALF = W/2; \ const Word LOMASK = ((((Word)1)<>WHALF; \ Word alo = a& LOMASK; \ Word bhi = b>>WHALF; \ Word blo = b& LOMASK; \ \ Word ahbl = ahi*blo; \ Word albh = alo*bhi; \ \ Word ahbl_albh = ((ahbl&LOMASK) + (albh&LOMASK)); \ Word hi = ahi*bhi + (ahbl>>WHALF) + (albh>>WHALF); \ hi += ahbl_albh >> WHALF; /* carry from the sum of lo(ahbl) + lo(albh) ) */ \ /* carry from the sum with alo*blo */ \ hi += ((lo >> WHALF) < (ahbl_albh&LOMASK)); \ *hip = hi; \ return lo; \ } /* // A template for mulhilo on a platform that can't do it // We could put a C version here, but is it better to run *VERY* // slowly or to just stop and force the user to find another CBRNG? */ #define _mulhilo_fail_tpl(W, Word) \ R123_STATIC_INLINE Word mulhilo##W(Word a, Word b, Word *hip){ \ R123_STATIC_ASSERT(0, "mulhilo" #W " is not implemented on this machine\n"); \ } /* // N.B. There's an MSVC intrinsic called _emul, // which *might* compile into better code than // _mulhilo_dword_tpl */ #if R123_USE_MULHILO32_ASM #ifdef __powerpc__ _mulhilo_asm_tpl(32, uint32_t, "mulhwu") #else _mulhilo_asm_tpl(32, uint32_t, "mull") #endif /* __powerpc__ */ #else _mulhilo_dword_tpl(32, uint32_t, uint64_t) #endif #if R123_USE_PHILOX_64BIT #if R123_USE_MULHILO64_ASM #ifdef __powerpc64__ _mulhilo_asm_tpl(64, uint64_t, "mulhdu") #else _mulhilo_asm_tpl(64, uint64_t, "mulq") #endif /* __powerpc64__ */ #elif R123_USE_MULHILO64_MSVC_INTRIN _mulhilo_msvc_intrin_tpl(64, uint64_t, _umul128) #elif R123_USE_MULHILO64_CUDA_INTRIN _mulhilo_cuda_intrin_tpl(64, uint64_t, __umul64hi) #elif R123_USE_MULHILO64_OPENCL_INTRIN _mulhilo_cuda_intrin_tpl(64, uint64_t, mul_hi) #elif R123_USE_MULHILO64_MULHI_INTRIN _mulhilo_cuda_intrin_tpl(64, uint64_t, R123_MULHILO64_MULHI_INTRIN) #elif R123_USE_GNU_UINT128 _mulhilo_dword_tpl(64, uint64_t, __uint128_t) #elif R123_USE_MULHILO64_C99 _mulhilo_c99_tpl(64, uint64_t) #else _mulhilo_fail_tpl(64, uint64_t) #endif #endif /* // The multipliers and Weyl constants are "hard coded". // To change them, you can #define them with different // values before #include-ing this file. // This isn't terribly elegant, but it works for C as // well as C++. A nice C++-only solution would be to // use template parameters in the style of */ #ifndef PHILOX_M2x64_0 #define PHILOX_M2x64_0 R123_64BIT(0xD2B74407B1CE6E93) #endif #ifndef PHILOX_M4x64_0 #define PHILOX_M4x64_0 R123_64BIT(0xD2E7470EE14C6C93) #endif #ifndef PHILOX_M4x64_1 #define PHILOX_M4x64_1 R123_64BIT(0xCA5A826395121157) #endif #ifndef PHILOX_M2x32_0 #define PHILOX_M2x32_0 ((uint32_t)0xd256d193) #endif #ifndef PHILOX_M4x32_0 #define PHILOX_M4x32_0 ((uint32_t)0xD2511F53) #endif #ifndef PHILOX_M4x32_1 #define PHILOX_M4x32_1 ((uint32_t)0xCD9E8D57) #endif #ifndef PHILOX_W64_0 #define PHILOX_W64_0 R123_64BIT(0x9E3779B97F4A7C15) /* golden ratio */ #endif #ifndef PHILOX_W64_1 #define PHILOX_W64_1 R123_64BIT(0xBB67AE8584CAA73B) /* sqrt(3)-1 */ #endif #ifndef PHILOX_W32_0 #define PHILOX_W32_0 ((uint32_t)0x9E3779B9) #endif #ifndef PHILOX_W32_1 #define PHILOX_W32_1 ((uint32_t)0xBB67AE85) #endif #ifndef PHILOX2x32_DEFAULT_ROUNDS #define PHILOX2x32_DEFAULT_ROUNDS 10 #endif #ifndef PHILOX2x64_DEFAULT_ROUNDS #define PHILOX2x64_DEFAULT_ROUNDS 10 #endif #ifndef PHILOX4x32_DEFAULT_ROUNDS #define PHILOX4x32_DEFAULT_ROUNDS 10 #endif #ifndef PHILOX4x64_DEFAULT_ROUNDS #define PHILOX4x64_DEFAULT_ROUNDS 10 #endif /* The ignored fourth argument allows us to instantiate the same macro regardless of N. */ #define _philox2xWround_tpl(W, T) \ R123_CUDA_DEVICE R123_STATIC_INLINE R123_FORCE_INLINE(struct r123array2x##W _philox2x##W##round(struct r123array2x##W ctr, struct r123array1x##W key)); \ R123_CUDA_DEVICE R123_STATIC_INLINE struct r123array2x##W _philox2x##W##round(struct r123array2x##W ctr, struct r123array1x##W key){ \ T hi; \ T lo = mulhilo##W(PHILOX_M2x##W##_0, ctr.v[0], &hi); \ struct r123array2x##W out = {{hi^key.v[0]^ctr.v[1], lo}}; \ return out; \ } #define _philox2xWbumpkey_tpl(W) \ R123_CUDA_DEVICE R123_STATIC_INLINE struct r123array1x##W _philox2x##W##bumpkey( struct r123array1x##W key) { \ key.v[0] += PHILOX_W##W##_0; \ return key; \ } #define _philox4xWround_tpl(W, T) \ R123_CUDA_DEVICE R123_STATIC_INLINE R123_FORCE_INLINE(struct r123array4x##W _philox4x##W##round(struct r123array4x##W ctr, struct r123array2x##W key)); \ R123_CUDA_DEVICE R123_STATIC_INLINE struct r123array4x##W _philox4x##W##round(struct r123array4x##W ctr, struct r123array2x##W key){ \ T hi0; \ T hi1; \ T lo0 = mulhilo##W(PHILOX_M4x##W##_0, ctr.v[0], &hi0); \ T lo1 = mulhilo##W(PHILOX_M4x##W##_1, ctr.v[2], &hi1); \ struct r123array4x##W out = {{hi1^ctr.v[1]^key.v[0], lo1, \ hi0^ctr.v[3]^key.v[1], lo0}}; \ return out; \ } #define _philox4xWbumpkey_tpl(W) \ R123_CUDA_DEVICE R123_STATIC_INLINE struct r123array2x##W _philox4x##W##bumpkey( struct r123array2x##W key) { \ key.v[0] += PHILOX_W##W##_0; \ key.v[1] += PHILOX_W##W##_1; \ return key; \ } #define _philoxNxW_tpl(N, Nhalf, W, T) \ /** @ingroup PhiloxNxW */ \ enum r123_enum_philox##N##x##W { philox##N##x##W##_rounds = PHILOX##N##x##W##_DEFAULT_ROUNDS }; \ typedef struct r123array##N##x##W philox##N##x##W##_ctr_t; \ typedef struct r123array##Nhalf##x##W philox##N##x##W##_key_t; \ typedef struct r123array##Nhalf##x##W philox##N##x##W##_ukey_t; \ R123_CUDA_DEVICE R123_STATIC_INLINE philox##N##x##W##_key_t philox##N##x##W##keyinit(philox##N##x##W##_ukey_t uk) { return uk; } \ R123_CUDA_DEVICE R123_STATIC_INLINE R123_FORCE_INLINE(philox##N##x##W##_ctr_t philox##N##x##W##_R(unsigned int R, philox##N##x##W##_ctr_t ctr, philox##N##x##W##_key_t key)); \ R123_CUDA_DEVICE R123_STATIC_INLINE philox##N##x##W##_ctr_t philox##N##x##W##_R(unsigned int R, philox##N##x##W##_ctr_t ctr, philox##N##x##W##_key_t key) { \ R123_ASSERT(R<=16); \ if(R>0){ ctr = _philox##N##x##W##round(ctr, key); } \ if(R>1){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>2){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>3){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>4){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>5){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>6){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>7){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>8){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>9){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>10){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>11){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>12){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>13){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>14){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ if(R>15){ key = _philox##N##x##W##bumpkey(key); ctr = _philox##N##x##W##round(ctr, key); } \ return ctr; \ } _philox2xWbumpkey_tpl(32) _philox4xWbumpkey_tpl(32) _philox2xWround_tpl(32, uint32_t) /* philo2x32round */ _philox4xWround_tpl(32, uint32_t) /* philo4x32round */ /** \endcond */ _philoxNxW_tpl(2, 1, 32, uint32_t) /* philox2x32bijection */ _philoxNxW_tpl(4, 2, 32, uint32_t) /* philox4x32bijection */ #if R123_USE_PHILOX_64BIT /** \cond HIDDEN_FROM_DOXYGEN */ _philox2xWbumpkey_tpl(64) _philox4xWbumpkey_tpl(64) _philox2xWround_tpl(64, uint64_t) /* philo2x64round */ _philox4xWround_tpl(64, uint64_t) /* philo4x64round */ /** \endcond */ _philoxNxW_tpl(2, 1, 64, uint64_t) /* philox2x64bijection */ _philoxNxW_tpl(4, 2, 64, uint64_t) /* philox4x64bijection */ #endif /* R123_USE_PHILOX_64BIT */ #define philox2x32(c,k) philox2x32_R(philox2x32_rounds, c, k) #define philox4x32(c,k) philox4x32_R(philox4x32_rounds, c, k) #if R123_USE_PHILOX_64BIT #define philox2x64(c,k) philox2x64_R(philox2x64_rounds, c, k) #define philox4x64(c,k) philox4x64_R(philox4x64_rounds, c, k) #endif /* R123_USE_PHILOX_64BIT */ #ifdef __cplusplus #include /** \cond HIDDEN_FROM_DOXYGEN */ #define _PhiloxNxW_base_tpl(CType, KType, N, W) \ namespace r123{ \ template \ struct Philox##N##x##W##_R{ \ typedef CType ctr_type; \ typedef KType key_type; \ typedef KType ukey_type; \ static const unsigned int rounds=ROUNDS; \ inline R123_CUDA_DEVICE R123_FORCE_INLINE(ctr_type operator()(ctr_type ctr, key_type key) const){ \ R123_STATIC_ASSERT(ROUNDS<=16, "philox is only unrolled up to 16 rounds\n"); \ return philox##N##x##W##_R(ROUNDS, ctr, key); \ } \ }; \ typedef Philox##N##x##W##_R Philox##N##x##W; \ } // namespace r123 /** \endcond */ _PhiloxNxW_base_tpl(r123array2x32, r123array1x32, 2, 32) // Philox2x32_R _PhiloxNxW_base_tpl(r123array4x32, r123array2x32, 4, 32) // Philox4x32_R #if R123_USE_PHILOX_64BIT _PhiloxNxW_base_tpl(r123array2x64, r123array1x64, 2, 64) // Philox2x64_R _PhiloxNxW_base_tpl(r123array4x64, r123array2x64, 4, 64) // Philox4x64_R #endif /* The _tpl macros don't quite work to do string-pasting inside comments. so we just write out the boilerplate documentation four times... */ /** @defgroup PhiloxNxW Philox Classes and Typedefs The PhiloxNxW classes export the member functions, typedefs and operator overloads required by a @ref CBRNG "CBRNG" class. As described in Parallel Random Numbers: As Easy as 1, 2, 3 . The Philox family of counter-based RNGs use integer multiplication, xor and permutation of W-bit words to scramble its N-word input key. Philox is a mnemonic for Product HI LO Xor). @class r123::Philox2x32_R @ingroup PhiloxNxW exports the member functions, typedefs and operator overloads required by a @ref CBRNG "CBRNG" class. The template argument, ROUNDS, is the number of times the Philox round function will be applied. As of November 2011, the authors know of no statistical flaws with ROUNDS=6 or more for Philox2x32. @typedef r123::Philox2x32 @ingroup PhiloxNxW Philox2x32 is equivalent to Philox2x32_R<10>. With 10 rounds, Philox2x32 has a considerable safety margin over the minimum number of rounds with no known statistical flaws, but still has excellent performance. @class r123::Philox2x64_R @ingroup PhiloxNxW exports the member functions, typedefs and operator overloads required by a @ref CBRNG "CBRNG" class. The template argument, ROUNDS, is the number of times the Philox round function will be applied. As of September 2011, the authors know of no statistical flaws with ROUNDS=6 or more for Philox2x64. @typedef r123::Philox2x64 @ingroup PhiloxNxW Philox2x64 is equivalent to Philox2x64_R<10>. With 10 rounds, Philox2x64 has a considerable safety margin over the minimum number of rounds with no known statistical flaws, but still has excellent performance. @class r123::Philox4x32_R @ingroup PhiloxNxW exports the member functions, typedefs and operator overloads required by a @ref CBRNG "CBRNG" class. The template argument, ROUNDS, is the number of times the Philox round function will be applied. In November 2011, the authors recorded some suspicious p-values (approximately 1.e-7) from some very long (longer than the default BigCrush length) SimpPoker tests. Despite the fact that even longer tests reverted to "passing" p-values, a cloud remains over Philox4x32 with 7 rounds. The authors know of no statistical flaws with ROUNDS=8 or more for Philox4x32. @typedef r123::Philox4x32 @ingroup PhiloxNxW Philox4x32 is equivalent to Philox4x32_R<10>. With 10 rounds, Philox4x32 has a considerable safety margin over the minimum number of rounds with no known statistical flaws, but still has excellent performance. @class r123::Philox4x64_R @ingroup PhiloxNxW exports the member functions, typedefs and operator overloads required by a @ref CBRNG "CBRNG" class. The template argument, ROUNDS, is the number of times the Philox round function will be applied. As of September 2011, the authors know of no statistical flaws with ROUNDS=7 or more for Philox4x64. @typedef r123::Philox4x64 @ingroup PhiloxNxW Philox4x64 is equivalent to Philox4x64_R<10>. With 10 rounds, Philox4x64 has a considerable safety margin over the minimum number of rounds with no known statistical flaws, but still has excellent performance. */ #endif /* __cplusplus */ #endif /* _philox_dot_h_ */ pyopencl-2025.1/pyopencl/cl/pyopencl-random123/threefry.cl0000644000000000000000000015265314332717401020334 0ustar00/* Copyright 2010-2011, D. E. Shaw Research. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of D. E. Shaw Research nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifndef _threefry_dot_h_ #define _threefry_dot_h_ #include "openclfeatures.h" #include "array.h" /** \cond HIDDEN_FROM_DOXYGEN */ /* Significant parts of this file were copied from from: Skein_FinalRnd/ReferenceImplementation/skein.h Skein_FinalRnd/ReferenceImplementation/skein_block.c in http://csrc.nist.gov/groups/ST/hash/sha-3/Round3/documents/Skein_FinalRnd.zip This file has been modified so that it may no longer perform its originally intended function. If you're looking for a Skein or Threefish source code, please consult the original file. The original file had the following header: ************************************************************************** ** ** Interface declarations and internal definitions for Skein hashing. ** ** Source code author: Doug Whiting, 2008. ** ** This algorithm and source code is released to the public domain. ** *************************************************************************** */ /* See comment at the top of philox.h for the macro pre-process strategy. */ /* Rotation constants: */ enum r123_enum_threefry64x4 { /* These are the R_256 constants from the Threefish reference sources with names changed to R_64x4... */ R_64x4_0_0=14, R_64x4_0_1=16, R_64x4_1_0=52, R_64x4_1_1=57, R_64x4_2_0=23, R_64x4_2_1=40, R_64x4_3_0= 5, R_64x4_3_1=37, R_64x4_4_0=25, R_64x4_4_1=33, R_64x4_5_0=46, R_64x4_5_1=12, R_64x4_6_0=58, R_64x4_6_1=22, R_64x4_7_0=32, R_64x4_7_1=32 }; enum r123_enum_threefry64x2 { /* // Output from skein_rot_search: (srs64_B64-X1000) // Random seed = 1. BlockSize = 128 bits. sampleCnt = 1024. rounds = 8, minHW_or=57 // Start: Tue Mar 1 10:07:48 2011 // rMin = 0.136. #0325[*15] [CRC=455A682F. hw_OR=64. cnt=16384. blkSize= 128].format */ R_64x2_0_0=16, R_64x2_1_0=42, R_64x2_2_0=12, R_64x2_3_0=31, R_64x2_4_0=16, R_64x2_5_0=32, R_64x2_6_0=24, R_64x2_7_0=21 /* 4 rounds: minHW = 4 [ 4 4 4 4 ] // 5 rounds: minHW = 8 [ 8 8 8 8 ] // 6 rounds: minHW = 16 [ 16 16 16 16 ] // 7 rounds: minHW = 32 [ 32 32 32 32 ] // 8 rounds: minHW = 64 [ 64 64 64 64 ] // 9 rounds: minHW = 64 [ 64 64 64 64 ] //10 rounds: minHW = 64 [ 64 64 64 64 ] //11 rounds: minHW = 64 [ 64 64 64 64 ] */ }; enum r123_enum_threefry32x4 { /* Output from skein_rot_search: (srs-B128-X5000.out) // Random seed = 1. BlockSize = 64 bits. sampleCnt = 1024. rounds = 8, minHW_or=28 // Start: Mon Aug 24 22:41:36 2009 // ... // rMin = 0.472. #0A4B[*33] [CRC=DD1ECE0F. hw_OR=31. cnt=16384. blkSize= 128].format */ R_32x4_0_0=10, R_32x4_0_1=26, R_32x4_1_0=11, R_32x4_1_1=21, R_32x4_2_0=13, R_32x4_2_1=27, R_32x4_3_0=23, R_32x4_3_1= 5, R_32x4_4_0= 6, R_32x4_4_1=20, R_32x4_5_0=17, R_32x4_5_1=11, R_32x4_6_0=25, R_32x4_6_1=10, R_32x4_7_0=18, R_32x4_7_1=20 /* 4 rounds: minHW = 3 [ 3 3 3 3 ] // 5 rounds: minHW = 7 [ 7 7 7 7 ] // 6 rounds: minHW = 12 [ 13 12 13 12 ] // 7 rounds: minHW = 22 [ 22 23 22 23 ] // 8 rounds: minHW = 31 [ 31 31 31 31 ] // 9 rounds: minHW = 32 [ 32 32 32 32 ] //10 rounds: minHW = 32 [ 32 32 32 32 ] //11 rounds: minHW = 32 [ 32 32 32 32 ] */ }; enum r123_enum_threefry32x2 { /* Output from skein_rot_search (srs32x2-X5000.out) // Random seed = 1. BlockSize = 64 bits. sampleCnt = 1024. rounds = 8, minHW_or=28 // Start: Tue Jul 12 11:11:33 2011 // rMin = 0.334. #0206[*07] [CRC=1D9765C0. hw_OR=32. cnt=16384. blkSize= 64].format */ R_32x2_0_0=13, R_32x2_1_0=15, R_32x2_2_0=26, R_32x2_3_0= 6, R_32x2_4_0=17, R_32x2_5_0=29, R_32x2_6_0=16, R_32x2_7_0=24 /* 4 rounds: minHW = 4 [ 4 4 4 4 ] // 5 rounds: minHW = 6 [ 6 8 6 8 ] // 6 rounds: minHW = 9 [ 9 12 9 12 ] // 7 rounds: minHW = 16 [ 16 24 16 24 ] // 8 rounds: minHW = 32 [ 32 32 32 32 ] // 9 rounds: minHW = 32 [ 32 32 32 32 ] //10 rounds: minHW = 32 [ 32 32 32 32 ] //11 rounds: minHW = 32 [ 32 32 32 32 ] */ }; enum r123_enum_threefry_wcnt { WCNT2=2, WCNT4=4 }; R123_CUDA_DEVICE R123_STATIC_INLINE R123_FORCE_INLINE(uint64_t RotL_64(uint64_t x, unsigned int N)); R123_CUDA_DEVICE R123_STATIC_INLINE uint64_t RotL_64(uint64_t x, unsigned int N) { return (x << (N & 63)) | (x >> ((64-N) & 63)); } R123_CUDA_DEVICE R123_STATIC_INLINE R123_FORCE_INLINE(uint32_t RotL_32(uint32_t x, unsigned int N)); R123_CUDA_DEVICE R123_STATIC_INLINE uint32_t RotL_32(uint32_t x, unsigned int N) { return (x << (N & 31)) | (x >> ((32-N) & 31)); } #define SKEIN_MK_64(hi32,lo32) ((lo32) + (((uint64_t) (hi32)) << 32)) #define SKEIN_KS_PARITY64 SKEIN_MK_64(0x1BD11BDA,0xA9FC1A22) #define SKEIN_KS_PARITY32 0x1BD11BDA #ifndef THREEFRY2x32_DEFAULT_ROUNDS #define THREEFRY2x32_DEFAULT_ROUNDS 20 #endif #ifndef THREEFRY2x64_DEFAULT_ROUNDS #define THREEFRY2x64_DEFAULT_ROUNDS 20 #endif #ifndef THREEFRY4x32_DEFAULT_ROUNDS #define THREEFRY4x32_DEFAULT_ROUNDS 20 #endif #ifndef THREEFRY4x64_DEFAULT_ROUNDS #define THREEFRY4x64_DEFAULT_ROUNDS 20 #endif #define _threefry2x_tpl(W) \ typedef struct r123array2x##W threefry2x##W##_ctr_t; \ typedef struct r123array2x##W threefry2x##W##_key_t; \ typedef struct r123array2x##W threefry2x##W##_ukey_t; \ R123_CUDA_DEVICE R123_STATIC_INLINE threefry2x##W##_key_t threefry2x##W##keyinit(threefry2x##W##_ukey_t uk) { return uk; } \ R123_CUDA_DEVICE R123_STATIC_INLINE R123_FORCE_INLINE(threefry2x##W##_ctr_t threefry2x##W##_R(unsigned int Nrounds, threefry2x##W##_ctr_t in, threefry2x##W##_key_t k)); \ R123_CUDA_DEVICE R123_STATIC_INLINE \ threefry2x##W##_ctr_t threefry2x##W##_R(unsigned int Nrounds, threefry2x##W##_ctr_t in, threefry2x##W##_key_t k){ \ threefry2x##W##_ctr_t X; \ uint##W##_t ks[2+1]; \ int i; /* avoid size_t to avoid need for stddef.h */ \ R123_ASSERT(Nrounds<=32); \ ks[2] = SKEIN_KS_PARITY##W; \ for (i=0;i < 2; i++) \ { \ ks[i] = k.v[i]; \ X.v[i] = in.v[i]; \ ks[2] ^= k.v[i]; \ } \ \ /* Insert initial key before round 0 */ \ X.v[0] += ks[0]; X.v[1] += ks[1]; \ \ if(Nrounds>0){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_0_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>1){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_1_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>2){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_2_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>3){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_3_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>3){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[1]; X.v[1] += ks[2]; \ X.v[1] += 1; /* X.v[2-1] += r */ \ } \ if(Nrounds>4){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_4_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>5){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_5_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>6){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_6_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>7){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_7_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>7){ \ /* InjectKey(r=2) */ \ X.v[0] += ks[2]; X.v[1] += ks[0]; \ X.v[1] += 2; \ } \ if(Nrounds>8){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_0_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>9){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_1_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>10){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_2_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>11){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_3_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>11){ \ /* InjectKey(r=3) */ \ X.v[0] += ks[0]; X.v[1] += ks[1]; \ X.v[1] += 3; \ } \ if(Nrounds>12){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_4_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>13){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_5_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>14){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_6_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>15){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_7_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>15){ \ /* InjectKey(r=4) */ \ X.v[0] += ks[1]; X.v[1] += ks[2]; \ X.v[1] += 4; \ } \ if(Nrounds>16){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_0_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>17){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_1_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>18){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_2_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>19){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_3_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>19){ \ /* InjectKey(r=5) */ \ X.v[0] += ks[2]; X.v[1] += ks[0]; \ X.v[1] += 5; \ } \ if(Nrounds>20){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_4_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>21){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_5_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>22){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_6_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>23){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_7_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>23){ \ /* InjectKey(r=6) */ \ X.v[0] += ks[0]; X.v[1] += ks[1]; \ X.v[1] += 6; \ } \ if(Nrounds>24){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_0_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>25){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_1_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>26){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_2_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>27){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_3_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>27){ \ /* InjectKey(r=7) */ \ X.v[0] += ks[1]; X.v[1] += ks[2]; \ X.v[1] += 7; \ } \ if(Nrounds>28){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_4_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>29){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_5_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>30){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_6_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>31){ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x2_7_0); X.v[1] ^= X.v[0]; } \ if(Nrounds>31){ \ /* InjectKey(r=8) */ \ X.v[0] += ks[2]; X.v[1] += ks[0]; \ X.v[1] += 8; \ } \ return X; \ } \ /** @ingroup ThreefryNxW */ \ enum r123_enum_threefry2x##W { threefry2x##W##_rounds = THREEFRY2x##W##_DEFAULT_ROUNDS }; \ R123_CUDA_DEVICE R123_STATIC_INLINE R123_FORCE_INLINE(threefry2x##W##_ctr_t threefry2x##W(threefry2x##W##_ctr_t in, threefry2x##W##_key_t k)); \ R123_CUDA_DEVICE R123_STATIC_INLINE \ threefry2x##W##_ctr_t threefry2x##W(threefry2x##W##_ctr_t in, threefry2x##W##_key_t k){ \ return threefry2x##W##_R(threefry2x##W##_rounds, in, k); \ } #define _threefry4x_tpl(W) \ typedef struct r123array4x##W threefry4x##W##_ctr_t; \ typedef struct r123array4x##W threefry4x##W##_key_t; \ typedef struct r123array4x##W threefry4x##W##_ukey_t; \ R123_CUDA_DEVICE R123_STATIC_INLINE threefry4x##W##_key_t threefry4x##W##keyinit(threefry4x##W##_ukey_t uk) { return uk; } \ R123_CUDA_DEVICE R123_STATIC_INLINE R123_FORCE_INLINE(threefry4x##W##_ctr_t threefry4x##W##_R(unsigned int Nrounds, threefry4x##W##_ctr_t in, threefry4x##W##_key_t k)); \ R123_CUDA_DEVICE R123_STATIC_INLINE \ threefry4x##W##_ctr_t threefry4x##W##_R(unsigned int Nrounds, threefry4x##W##_ctr_t in, threefry4x##W##_key_t k){ \ threefry4x##W##_ctr_t X; \ uint##W##_t ks[4+1]; \ int i; /* avoid size_t to avoid need for stddef.h */ \ R123_ASSERT(Nrounds<=72); \ ks[4] = SKEIN_KS_PARITY##W; \ for (i=0;i < 4; i++) \ { \ ks[i] = k.v[i]; \ X.v[i] = in.v[i]; \ ks[4] ^= k.v[i]; \ } \ \ /* Insert initial key before round 0 */ \ X.v[0] += ks[0]; X.v[1] += ks[1]; X.v[2] += ks[2]; X.v[3] += ks[3]; \ \ if(Nrounds>0){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_0_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_0_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>1){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_1_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_1_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>2){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_2_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_2_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>3){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_3_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_3_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>3){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[1]; X.v[1] += ks[2]; X.v[2] += ks[3]; X.v[3] += ks[4]; \ X.v[4-1] += 1; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>4){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_4_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_4_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>5){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_5_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_5_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>6){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_6_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_6_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>7){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_7_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_7_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>7){ \ /* InjectKey(r=2) */ \ X.v[0] += ks[2]; X.v[1] += ks[3]; X.v[2] += ks[4]; X.v[3] += ks[0]; \ X.v[4-1] += 2; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>8){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_0_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_0_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>9){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_1_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_1_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>10){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_2_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_2_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>11){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_3_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_3_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>11){ \ /* InjectKey(r=3) */ \ X.v[0] += ks[3]; X.v[1] += ks[4]; X.v[2] += ks[0]; X.v[3] += ks[1]; \ X.v[4-1] += 3; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>12){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_4_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_4_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>13){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_5_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_5_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>14){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_6_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_6_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>15){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_7_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_7_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>15){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[4]; X.v[1] += ks[0]; X.v[2] += ks[1]; X.v[3] += ks[2]; \ X.v[4-1] += 4; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>16){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_0_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_0_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>17){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_1_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_1_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>18){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_2_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_2_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>19){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_3_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_3_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>19){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[0]; X.v[1] += ks[1]; X.v[2] += ks[2]; X.v[3] += ks[3]; \ X.v[4-1] += 5; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>20){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_4_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_4_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>21){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_5_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_5_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>22){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_6_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_6_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>23){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_7_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_7_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>23){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[1]; X.v[1] += ks[2]; X.v[2] += ks[3]; X.v[3] += ks[4]; \ X.v[4-1] += 6; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>24){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_0_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_0_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>25){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_1_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_1_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>26){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_2_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_2_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>27){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_3_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_3_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>27){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[2]; X.v[1] += ks[3]; X.v[2] += ks[4]; X.v[3] += ks[0]; \ X.v[4-1] += 7; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>28){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_4_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_4_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>29){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_5_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_5_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>30){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_6_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_6_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>31){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_7_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_7_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>31){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[3]; X.v[1] += ks[4]; X.v[2] += ks[0]; X.v[3] += ks[1]; \ X.v[4-1] += 8; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>32){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_0_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_0_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>33){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_1_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_1_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>34){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_2_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_2_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>35){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_3_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_3_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>35){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[4]; X.v[1] += ks[0]; X.v[2] += ks[1]; X.v[3] += ks[2]; \ X.v[4-1] += 9; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>36){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_4_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_4_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>37){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_5_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_5_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>38){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_6_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_6_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>39){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_7_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_7_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>39){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[0]; X.v[1] += ks[1]; X.v[2] += ks[2]; X.v[3] += ks[3]; \ X.v[4-1] += 10; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>40){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_0_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_0_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>41){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_1_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_1_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>42){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_2_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_2_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>43){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_3_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_3_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>43){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[1]; X.v[1] += ks[2]; X.v[2] += ks[3]; X.v[3] += ks[4]; \ X.v[4-1] += 11; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>44){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_4_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_4_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>45){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_5_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_5_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>46){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_6_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_6_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>47){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_7_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_7_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>47){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[2]; X.v[1] += ks[3]; X.v[2] += ks[4]; X.v[3] += ks[0]; \ X.v[4-1] += 12; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>48){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_0_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_0_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>49){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_1_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_1_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>50){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_2_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_2_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>51){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_3_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_3_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>51){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[3]; X.v[1] += ks[4]; X.v[2] += ks[0]; X.v[3] += ks[1]; \ X.v[4-1] += 13; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>52){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_4_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_4_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>53){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_5_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_5_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>54){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_6_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_6_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>55){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_7_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_7_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>55){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[4]; X.v[1] += ks[0]; X.v[2] += ks[1]; X.v[3] += ks[2]; \ X.v[4-1] += 14; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>56){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_0_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_0_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>57){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_1_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_1_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>58){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_2_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_2_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>59){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_3_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_3_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>59){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[0]; X.v[1] += ks[1]; X.v[2] += ks[2]; X.v[3] += ks[3]; \ X.v[4-1] += 15; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>60){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_4_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_4_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>61){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_5_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_5_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>62){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_6_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_6_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>63){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_7_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_7_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>63){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[1]; X.v[1] += ks[2]; X.v[2] += ks[3]; X.v[3] += ks[4]; \ X.v[4-1] += 16; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>64){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_0_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_0_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>65){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_1_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_1_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>66){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_2_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_2_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>67){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_3_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_3_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>67){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[2]; X.v[1] += ks[3]; X.v[2] += ks[4]; X.v[3] += ks[0]; \ X.v[4-1] += 17; /* X.v[WCNT4-1] += r */ \ } \ \ if(Nrounds>68){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_4_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_4_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>69){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_5_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_5_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>70){ \ X.v[0] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_6_0); X.v[1] ^= X.v[0]; \ X.v[2] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_6_1); X.v[3] ^= X.v[2]; \ } \ if(Nrounds>71){ \ X.v[0] += X.v[3]; X.v[3] = RotL_##W(X.v[3],R_##W##x4_7_0); X.v[3] ^= X.v[0]; \ X.v[2] += X.v[1]; X.v[1] = RotL_##W(X.v[1],R_##W##x4_7_1); X.v[1] ^= X.v[2]; \ } \ if(Nrounds>71){ \ /* InjectKey(r=1) */ \ X.v[0] += ks[3]; X.v[1] += ks[4]; X.v[2] += ks[0]; X.v[3] += ks[1]; \ X.v[4-1] += 18; /* X.v[WCNT4-1] += r */ \ } \ \ return X; \ } \ /** @ingroup ThreefryNxW */ \ enum r123_enum_threefry4x##W { threefry4x##W##_rounds = THREEFRY4x##W##_DEFAULT_ROUNDS }; \ R123_CUDA_DEVICE R123_STATIC_INLINE R123_FORCE_INLINE(threefry4x##W##_ctr_t threefry4x##W(threefry4x##W##_ctr_t in, threefry4x##W##_key_t k)); \ R123_CUDA_DEVICE R123_STATIC_INLINE \ threefry4x##W##_ctr_t threefry4x##W(threefry4x##W##_ctr_t in, threefry4x##W##_key_t k){ \ return threefry4x##W##_R(threefry4x##W##_rounds, in, k); \ } /** \endcond */ _threefry2x_tpl(64) _threefry2x_tpl(32) _threefry4x_tpl(64) _threefry4x_tpl(32) /* gcc4.5 and 4.6 seem to optimize a macro-ized threefryNxW better than a static inline function. Why? */ #define threefry2x32(c,k) threefry2x32_R(threefry2x32_rounds, c, k) #define threefry4x32(c,k) threefry4x32_R(threefry4x32_rounds, c, k) #define threefry2x64(c,k) threefry2x64_R(threefry2x64_rounds, c, k) #define threefry4x64(c,k) threefry4x64_R(threefry4x64_rounds, c, k) #ifdef __cplusplus /** \cond HIDDEN_FROM_DOXYGEN */ #define _threefryNxWclass_tpl(NxW) \ namespace r123{ \ template \ struct Threefry##NxW##_R{ \ typedef threefry##NxW##_ctr_t ctr_type; \ typedef threefry##NxW##_key_t key_type; \ typedef threefry##NxW##_key_t ukey_type; \ static const unsigned int rounds=R; \ inline R123_CUDA_DEVICE R123_FORCE_INLINE(ctr_type operator()(ctr_type ctr, key_type key)){ \ R123_STATIC_ASSERT(R<=72, "threefry is only unrolled up to 72 rounds\n"); \ return threefry##NxW##_R(R, ctr, key); \ } \ }; \ typedef Threefry##NxW##_R Threefry##NxW; \ } // namespace r123 /** \endcond */ _threefryNxWclass_tpl(2x32) _threefryNxWclass_tpl(4x32) _threefryNxWclass_tpl(2x64) _threefryNxWclass_tpl(4x64) /* The _tpl macros don't quite work to do string-pasting inside comments. so we just write out the boilerplate documentation four times... */ /** @defgroup ThreefryNxW Threefry Classes and Typedefs The ThreefryNxW classes export the member functions, typedefs and operator overloads required by a @ref CBRNG "CBRNG" class. As described in Parallel Random Numbers: As Easy as 1, 2, 3 , the Threefry family is closely related to the Threefish block cipher from Skein Hash Function. Threefry is \b not suitable for cryptographic use. Threefry uses integer addition, bitwise rotation, xor and permutation of words to randomize its output. @class r123::Threefry2x32_R @ingroup ThreefryNxW exports the member functions, typedefs and operator overloads required by a @ref CBRNG "CBRNG" class. The template argument, ROUNDS, is the number of times the Threefry round function will be applied. As of September 2011, the authors know of no statistical flaws with ROUNDS=13 or more for Threefry2x32. @typedef r123::Threefry2x32 @ingroup ThreefryNxW Threefry2x32 is equivalent to Threefry2x32_R<20>. With 20 rounds, Threefry2x32 has a considerable safety margin over the minimum number of rounds with no known statistical flaws, but still has excellent performance. @class r123::Threefry2x64_R @ingroup ThreefryNxW exports the member functions, typedefs and operator overloads required by a @ref CBRNG "CBRNG" class. The template argument, ROUNDS, is the number of times the Threefry round function will be applied. In November 2011, the authors discovered that 13 rounds of Threefry2x64 sequenced by strided, interleaved key and counter increments failed a very long (longer than the default BigCrush length) WeightDistrub test. At the same time, it was confirmed that 14 rounds passes much longer tests (up to 5x10^12 samples) of a similar nature. The authors know of no statistical flaws with ROUNDS=14 or more for Threefry2x64. @typedef r123::Threefry2x64 @ingroup ThreefryNxW Threefry2x64 is equivalent to Threefry2x64_R<20>. With 20 rounds, Threefry2x64 has a considerable safety margin over the minimum number of rounds with no known statistical flaws, but still has excellent performance. @class r123::Threefry4x32_R @ingroup ThreefryNxW exports the member functions, typedefs and operator overloads required by a @ref CBRNG "CBRNG" class. The template argument, ROUNDS, is the number of times the Threefry round function will be applied. As of September 2011, the authors know of no statistical flaws with ROUNDS=12 or more for Threefry4x32. @typedef r123::Threefry4x32 @ingroup ThreefryNxW Threefry4x32 is equivalent to Threefry4x32_R<20>. With 20 rounds, Threefry4x32 has a considerable safety margin over the minimum number of rounds with no known statistical flaws, but still has excellent performance. @class r123::Threefry4x64_R @ingroup ThreefryNxW exports the member functions, typedefs and operator overloads required by a @ref CBRNG "CBRNG" class. The template argument, ROUNDS, is the number of times the Threefry round function will be applied. As of September 2011, the authors know of no statistical flaws with ROUNDS=12 or more for Threefry4x64. @typedef r123::Threefry4x64 @ingroup ThreefryNxW Threefry4x64 is equivalent to Threefry4x64_R<20>. With 20 rounds, Threefry4x64 has a considerable safety margin over the minimum number of rounds with no known statistical flaws, but still has excellent performance. */ #endif #endif pyopencl-2025.1/pyopencl/clmath.py0000644000000000000000000002003614332717401014040 0ustar00# pylint:disable=unexpected-keyword-arg # for @elwise_kernel_runner __copyright__ = "Copyright (C) 2009 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import numpy as np import pyopencl.array as cl_array import pyopencl.elementwise as elementwise from pyopencl.array import _get_common_dtype def _make_unary_array_func(name): @cl_array.elwise_kernel_runner def knl_runner(result, arg): if arg.dtype.kind == "c": from pyopencl.elementwise import complex_dtype_to_name fname = "{}_{}".format(complex_dtype_to_name(arg.dtype), name) else: fname = name return elementwise.get_unary_func_kernel( result.context, fname, arg.dtype) def f(array, queue=None): result = array._new_like_me(queue=queue) event1 = knl_runner(result, array, queue=queue) result.add_event(event1) return result return f # See table 6.8 in the CL 1.1 spec acos = _make_unary_array_func("acos") acosh = _make_unary_array_func("acosh") acospi = _make_unary_array_func("acospi") asin = _make_unary_array_func("asin") asinh = _make_unary_array_func("asinh") asinpi = _make_unary_array_func("asinpi") @cl_array.elwise_kernel_runner def _atan2(result, arg1, arg2): return elementwise.get_float_binary_func_kernel( result.context, "atan2", arg1.dtype, arg2.dtype, result.dtype) @cl_array.elwise_kernel_runner def _atan2pi(result, arg1, arg2): return elementwise.get_float_binary_func_kernel( result.context, "atan2pi", arg1.dtype, arg2.dtype, result.dtype) atan = _make_unary_array_func("atan") def atan2(y, x, queue=None): """ .. versionadded:: 2013.1 """ queue = queue or y.queue result = y._new_like_me(_get_common_dtype(y, x, queue)) result.add_event(_atan2(result, y, x, queue=queue)) return result atanh = _make_unary_array_func("atanh") atanpi = _make_unary_array_func("atanpi") def atan2pi(y, x, queue=None): """ .. versionadded:: 2013.1 """ queue = queue or y.queue result = y._new_like_me(_get_common_dtype(y, x, queue)) result.add_event(_atan2pi(result, y, x, queue=queue)) return result cbrt = _make_unary_array_func("cbrt") ceil = _make_unary_array_func("ceil") # TODO: copysign cos = _make_unary_array_func("cos") cosh = _make_unary_array_func("cosh") cospi = _make_unary_array_func("cospi") erfc = _make_unary_array_func("erfc") erf = _make_unary_array_func("erf") exp = _make_unary_array_func("exp") exp2 = _make_unary_array_func("exp2") exp10 = _make_unary_array_func("exp10") expm1 = _make_unary_array_func("expm1") fabs = _make_unary_array_func("fabs") # TODO: fdim floor = _make_unary_array_func("floor") # TODO: fma # TODO: fmax # TODO: fmin @cl_array.elwise_kernel_runner def _fmod(result, arg, mod): return elementwise.get_fmod_kernel(result.context, result.dtype, arg.dtype, mod.dtype) def fmod(arg, mod, queue=None): """Return the floating point remainder of the division ``arg / mod``, for each element in ``arg`` and ``mod``.""" queue = (queue or arg.queue) or mod.queue result = arg._new_like_me(_get_common_dtype(arg, mod, queue)) result.add_event(_fmod(result, arg, mod, queue=queue)) return result # TODO: fract @cl_array.elwise_kernel_runner def _frexp(sig, expt, arg): return elementwise.get_frexp_kernel(sig.context, sig.dtype, expt.dtype, arg.dtype) def frexp(arg, queue=None): """Return a tuple ``(significands, exponents)`` such that ``arg == significand * 2**exponent``. """ sig = arg._new_like_me(queue=queue) expt = arg._new_like_me(queue=queue, dtype=np.int32) event1 = _frexp(sig, expt, arg, queue=queue) sig.add_event(event1) expt.add_event(event1) return sig, expt # TODO: hypot ilogb = _make_unary_array_func("ilogb") @cl_array.elwise_kernel_runner def _ldexp(result, sig, exp): return elementwise.get_ldexp_kernel(result.context, result.dtype, sig.dtype, exp.dtype) def ldexp(significand, exponent, queue=None): """Return a new array of floating point values composed from the entries of ``significand`` and ``exponent``, paired together as ``result = significand * 2**exponent``. """ result = significand._new_like_me(queue=queue) result.add_event(_ldexp(result, significand, exponent)) return result lgamma = _make_unary_array_func("lgamma") # TODO: lgamma_r log = _make_unary_array_func("log") log2 = _make_unary_array_func("log2") log10 = _make_unary_array_func("log10") log1p = _make_unary_array_func("log1p") logb = _make_unary_array_func("logb") # TODO: mad # TODO: maxmag # TODO: minmag @cl_array.elwise_kernel_runner def _modf(intpart, fracpart, arg): return elementwise.get_modf_kernel(intpart.context, intpart.dtype, fracpart.dtype, arg.dtype) def modf(arg, queue=None): """Return a tuple ``(fracpart, intpart)`` of arrays containing the integer and fractional parts of ``arg``. """ intpart = arg._new_like_me(queue=queue) fracpart = arg._new_like_me(queue=queue) event1 = _modf(intpart, fracpart, arg, queue=queue) fracpart.add_event(event1) intpart.add_event(event1) return fracpart, intpart nan = _make_unary_array_func("nan") # TODO: nextafter # TODO: remainder # TODO: remquo rint = _make_unary_array_func("rint") # TODO: rootn round = _make_unary_array_func("round") sin = _make_unary_array_func("sin") # TODO: sincos sinh = _make_unary_array_func("sinh") sinpi = _make_unary_array_func("sinpi") sqrt = _make_unary_array_func("sqrt") tan = _make_unary_array_func("tan") tanh = _make_unary_array_func("tanh") tanpi = _make_unary_array_func("tanpi") tgamma = _make_unary_array_func("tgamma") trunc = _make_unary_array_func("trunc") # no point wrapping half_ or native_ # TODO: table 6.10, integer functions # TODO: table 6.12, clamp et al @cl_array.elwise_kernel_runner def _bessel_jn(result, n, x): return elementwise.get_bessel_kernel(result.context, "j", result.dtype, np.dtype(type(n)), x.dtype) @cl_array.elwise_kernel_runner def _bessel_yn(result, n, x): return elementwise.get_bessel_kernel(result.context, "y", result.dtype, np.dtype(type(n)), x.dtype) @cl_array.elwise_kernel_runner def _hankel_01(h0, h1, x): if h0.dtype != h1.dtype: raise TypeError("types of h0 and h1 must match") return elementwise.get_hankel_01_kernel( h0.context, h0.dtype, x.dtype) def bessel_jn(n, x, queue=None): result = x._new_like_me(queue=queue) result.add_event(_bessel_jn(result, n, x, queue=queue)) return result def bessel_yn(n, x, queue=None): result = x._new_like_me(queue=queue) result.add_event(_bessel_yn(result, n, x, queue=queue)) return result def hankel_01(x, queue=None): h0 = x._new_like_me(queue=queue) h1 = x._new_like_me(queue=queue) event1 = _hankel_01(h0, h1, x, queue=queue) h0.add_event(event1) h1.add_event(event1) return h0, h1 pyopencl-2025.1/pyopencl/clrandom.py0000644000000000000000000003136414332717401014375 0ustar00__copyright__ = "Copyright (C) 2009-16 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ # {{{ documentation __doc__ = """ PyOpenCL includes and uses some of the `Random123 random number generators `__ by D.E. Shaw Research. In addition to being usable through the convenience functions above, they are available in any piece of code compiled through PyOpenCL by:: #include #include See the `Philox source `__ and the `Threefry source `__ for some documentation if you're planning on using Random123 directly. .. autoclass:: PhiloxGenerator .. autoclass:: ThreefryGenerator .. autofunction:: rand .. autofunction:: fill_rand """ # }}} import numpy as np from pytools import memoize_method import pyopencl as cl import pyopencl.array as cl_array import pyopencl.cltypes as cltypes from pyopencl.tools import first_arg_dependent_memoize # {{{ Random123 generators class Random123GeneratorBase: """ .. versionadded:: 2016.2 .. automethod:: fill_uniform .. automethod:: uniform .. automethod:: fill_normal .. automethod:: normal """ @property def header_name(self): raise NotImplementedError @property def generator_name(self): raise NotImplementedError @property def key_length(self): raise NotImplementedError def __init__(self, context, key=None, counter=None, seed=None): int32_info = np.iinfo(np.int32) from random import Random rng = Random(seed) if key is not None and counter is not None and seed is not None: raise TypeError("seed is unused and may not be specified " "if both counter and key are given") if key is None: key = [ rng.randrange( int(int32_info.min), int(int32_info.max)+1) for i in range(self.key_length-1)] if counter is None: counter = [ rng.randrange( int(int32_info.min), int(int32_info.max)+1) for i in range(4)] self.context = context self.key = key self.counter = counter self.counter_max = int32_info.max @memoize_method def get_gen_kernel(self, dtype, distribution): size_multiplier = 1 arg_dtype = dtype rng_key = (distribution, dtype) if rng_key in [("uniform", np.float64), ("normal", np.float64)]: c_type = "double" scale1_const = "((double) %r)" % (1/2**32) scale2_const = "((double) %r)" % (1/2**64) if distribution == "normal": transform = "box_muller" else: transform = "" rng_expr = ( "shift + scale * " "%s( %s * convert_double4(gen)" "+ %s * convert_double4(gen))" % (transform, scale1_const, scale2_const)) counter_multiplier = 2 elif rng_key in [(dist, cmp_dtype) for dist in ["normal", "uniform"] for cmp_dtype in [ np.float32, cltypes.float2, cltypes.float3, cltypes.float4, ]]: c_type = "float" scale_const = "((float) %r)" % (1/2**32) if distribution == "normal": transform = "box_muller" else: transform = "" rng_expr = ( "shift + scale * %s(%s * convert_float4(gen))" % (transform, scale_const)) counter_multiplier = 1 arg_dtype = np.float32 try: _, size_multiplier = cltypes.vec_type_to_scalar_and_count[dtype] except KeyError: pass elif rng_key == ("uniform", np.int32): c_type = "int" rng_expr = ( "shift + convert_int4((convert_long4(gen) * scale) / %s)" % (str(2**32)+"l") ) counter_multiplier = 1 elif rng_key == ("uniform", np.int64): c_type = "long" rng_expr = ( "shift" "+ convert_long4(gen) * (scale/two32) " "+ ((convert_long4(gen) * scale) / two32)" .replace("two32", (str(2**32)+"l"))) counter_multiplier = 2 else: raise TypeError( "unsupported RNG distribution/data type combination '%s/%s'" % rng_key) kernel_name = f"rng_gen_{self.generator_name}_{distribution}" src = """//CL// #include <{header_name}> #ifndef M_PI #ifdef M_PI_F #define M_PI M_PI_F #else #define M_PI 3.14159265359f #endif #endif typedef {output_t} output_t; typedef {output_t}4 output_vec_t; typedef {gen_name}_ctr_t ctr_t; typedef {gen_name}_key_t key_t; uint4 gen_bits(key_t *key, ctr_t *ctr) {{ union {{ ctr_t ctr_el; uint4 vec_el; }} u; u.ctr_el = {gen_name}(*ctr, *key); if (++ctr->v[0] == 0) if (++ctr->v[1] == 0) ++ctr->v[2]; return u.vec_el; }} #if {include_box_muller} output_vec_t box_muller(output_vec_t x) {{ #define BOX_MULLER(I, COMPA, COMPB) \ output_t r##I = sqrt(-2*log(x.COMPA)); \ output_t c##I; \ output_t s##I = sincos((output_t) (2*M_PI) * x.COMPB, &c##I); BOX_MULLER(0, x, y); BOX_MULLER(1, z, w); return (output_vec_t) (r0*c0, r0*s0, r1*c1, r1*s1); }} #endif #define GET_RANDOM_NUM(gen) {rng_expr} kernel void {kernel_name}( int k1, #if {key_length} > 2 int k2, int k3, #endif int c0, int c1, int c2, int c3, global output_t *output, long out_size, output_t scale, output_t shift) {{ #if {key_length} == 2 key_t k = {{{{get_global_id(0), k1}}}}; #else key_t k = {{{{get_global_id(0), k1, k2, k3}}}}; #endif ctr_t c = {{{{c0, c1, c2, c3}}}}; // output bulk unsigned long idx = get_global_id(0)*4; while (idx + 4 < out_size) {{ output_vec_t ran = GET_RANDOM_NUM(gen_bits(&k, &c)); vstore4(ran, 0, &output[idx]); idx += 4*get_global_size(0); }} // output tail output_vec_t tail_ran = GET_RANDOM_NUM(gen_bits(&k, &c)); if (idx < out_size) output[idx] = tail_ran.x; if (idx+1 < out_size) output[idx+1] = tail_ran.y; if (idx+2 < out_size) output[idx+2] = tail_ran.z; if (idx+3 < out_size) output[idx+3] = tail_ran.w; }} """.format( kernel_name=kernel_name, gen_name=self.generator_name, header_name=self.header_name, output_t=c_type, key_length=self.key_length, include_box_muller=int(distribution == "normal"), rng_expr=rng_expr ) prg = cl.Program(self.context, src).build() knl = getattr(prg, kernel_name) knl.set_scalar_arg_dtypes( [np.int32] * (self.key_length - 1 + 4) + [None, np.int64, arg_dtype, arg_dtype]) return knl, counter_multiplier, size_multiplier def _fill(self, distribution, ary, scale, shift, queue=None): """Fill *ary* with uniformly distributed random numbers in the interval *(a, b)*, endpoints excluded. :return: a :class:`pyopencl.Event` """ if queue is None: queue = ary.queue knl, counter_multiplier, size_multiplier = \ self.get_gen_kernel(ary.dtype, distribution) args = self.key + self.counter + [ ary.data, ary.size*size_multiplier, scale, shift] n = ary.size from pyopencl.array import _splay gsize, lsize = _splay(queue.device, ary.size) evt = knl(queue, gsize, lsize, *args) ary.add_event(evt) self.counter[0] += n * counter_multiplier c1_incr, self.counter[0] = divmod(self.counter[0], self.counter_max) if c1_incr: self.counter[1] += c1_incr c2_incr, self.counter[1] = divmod(self.counter[1], self.counter_max) self.counter[2] += c2_incr return evt def fill_uniform(self, ary, a=0, b=1, queue=None): return self._fill("uniform", ary, scale=(b-a), shift=a, queue=queue) def uniform(self, *args, **kwargs): """Make a new empty array, apply :meth:`fill_uniform` to it. """ a = kwargs.pop("a", 0) b = kwargs.pop("b", 1) result = cl_array.empty(*args, **kwargs) self.fill_uniform(result, queue=result.queue, a=a, b=b) return result def fill_normal(self, ary, mu=0, sigma=1, queue=None): """Fill *ary* with normally distributed numbers with mean *mu* and standard deviation *sigma*. """ return self._fill("normal", ary, scale=sigma, shift=mu, queue=queue) def normal(self, *args, **kwargs): """Make a new empty array, apply :meth:`fill_normal` to it. """ mu = kwargs.pop("mu", 0) sigma = kwargs.pop("sigma", 1) result = cl_array.empty(*args, **kwargs) self.fill_normal(result, queue=result.queue, mu=mu, sigma=sigma) return result class PhiloxGenerator(Random123GeneratorBase): __doc__ = Random123GeneratorBase.__doc__ header_name = "pyopencl-random123/philox.cl" generator_name = "philox4x32" key_length = 2 class ThreefryGenerator(Random123GeneratorBase): __doc__ = Random123GeneratorBase.__doc__ header_name = "pyopencl-random123/threefry.cl" generator_name = "threefry4x32" key_length = 4 # }}} @first_arg_dependent_memoize def _get_generator(context): if context.devices[0].type & cl.device_type.CPU: gen = PhiloxGenerator(context) else: gen = ThreefryGenerator(context) return gen def fill_rand(result, queue=None, a=0, b=1): """Fill *result* with random values in the range :math:`[0, 1)`. """ if queue is None: queue = result.queue gen = _get_generator(queue.context) gen.fill_uniform(result, a=a, b=b) def rand(queue, shape, dtype, luxury=None, a=0, b=1): """Return an array of *shape* filled with random values of *dtype* in the range :math:`[a, b)`. """ if luxury is not None: from warnings import warn warn("Specifying the 'luxury' argument is deprecated and will stop being " "supported in PyOpenCL 2018.x", stacklevel=2) from pyopencl.array import Array gen = _get_generator(queue.context) result = Array(queue, shape, dtype) gen.fill_uniform(result, a=a, b=b) return result # vim: filetype=pyopencl:foldmethod=marker pyopencl-2025.1/pyopencl/cltypes.py0000644000000000000000000001132014332717401014247 0ustar00__copyright__ = "Copyright (C) 2016 Jonathan Mackenzie" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import warnings import numpy as np from pyopencl.tools import get_or_register_dtype if __file__.endswith("array.py"): warnings.warn( "pyopencl.array.vec is deprecated. Please use pyopencl.cltypes.", stacklevel=2) """ This file provides a type mapping from OpenCl type names to their numpy equivalents """ char = np.int8 uchar = np.uint8 short = np.int16 ushort = np.uint16 int = np.int32 uint = np.uint32 long = np.int64 ulong = np.uint64 half = np.float16 float = np.float32 double = np.float64 # {{{ vector types def _create_vector_types(): mapping = [(k, globals()[k]) for k in ["char", "uchar", "short", "ushort", "int", "uint", "long", "ulong", "float", "double"]] def set_global(key, val): globals()[key] = val vec_types = {} vec_type_to_scalar_and_count = {} field_names = ["x", "y", "z", "w"] counts = [2, 3, 4, 8, 16] for base_name, base_type in mapping: for count in counts: name = "%s%d" % (base_name, count) titles = field_names[:count] padded_count = count if count == 3: padded_count = 4 names = ["s%d" % i for i in range(count)] while len(names) < padded_count: names.append("padding%d" % (len(names) - count)) if len(titles) < len(names): titles.extend((len(names) - len(titles)) * [None]) try: dtype = np.dtype({ "names": names, "formats": [base_type] * padded_count, "titles": titles}) except NotImplementedError: try: dtype = np.dtype([((n, title), base_type) for (n, title) in zip(names, titles)]) except TypeError: dtype = np.dtype([(n, base_type) for (n, title) in zip(names, titles)]) get_or_register_dtype(name, dtype) set_global(name, dtype) def create_array(dtype, count, padded_count, *args, **kwargs): if len(args) < count: from warnings import warn warn("default values for make_xxx are deprecated;" " instead specify all parameters or use" " cltypes.zeros_xxx", DeprecationWarning, stacklevel=4) padded_args = tuple(list(args) + [0] * (padded_count - len(args))) array = eval("array(padded_args, dtype=dtype)", {"array": np.array, "padded_args": padded_args, "dtype": dtype}) for key, val in list(kwargs.items()): array[key] = val return array set_global("make_" + name, eval( "lambda *args, **kwargs: create_array(dtype, %i, %i, " "*args, **kwargs)" % (count, padded_count), {"create_array": create_array, "dtype": dtype})) set_global("filled_" + name, eval( "lambda val: make_%s(*[val]*%i)" % (name, count))) set_global("zeros_" + name, eval("lambda: filled_%s(0)" % (name))) set_global("ones_" + name, eval("lambda: filled_%s(1)" % (name))) vec_types[np.dtype(base_type), count] = dtype vec_type_to_scalar_and_count[dtype] = np.dtype(base_type), count return vec_types, vec_type_to_scalar_and_count vec_types, vec_type_to_scalar_and_count = _create_vector_types() # }}} # vim: foldmethod=marker pyopencl-2025.1/pyopencl/compyte/.gitignore0000644000000000000000000000025014332717401015662 0ustar00build .*.sw[po] .sw[po] *~ *.pyc *.pyo *.egg-info MANIFEST dist setuptools*egg setuptools.pth distribute*egg distribute*tar.gz *.so *.o *.aux *.bbl *.blg *.log .cache pyopencl-2025.1/pyopencl/compyte/__init__.py0000644000000000000000000000000014332717401015774 0ustar00pyopencl-2025.1/pyopencl/compyte/array.py0000644000000000000000000001637614332717401015402 0ustar00__copyright__ = "Copyright (C) 2011 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import numpy as np def f_contiguous_strides(itemsize, shape): if shape: strides = [itemsize] for s in shape[:-1]: # NOTE: max(1, s) is used to handle 0-sized axes in `shape`; # the stride for `shape[i] <= 1` doesn't matter, but letting it be 0 # is not a good idea: https://github.com/inducer/arraycontext/pull/91 strides.append(strides[-1]*max(1, s)) return tuple(strides) else: return () def c_contiguous_strides(itemsize, shape): if shape: strides = [itemsize] for s in shape[:0:-1]: # NOTE: max(1, s) is used to handle 0-sized axes in `shape`; # the stride for `shape[i] <= 1` doesn't matter, but letting it be 0 # is not a good idea: https://github.com/inducer/arraycontext/pull/91 strides.append(strides[-1]*max(1, s)) return tuple(strides[::-1]) else: return () def equal_strides(strides1, strides2, shape): if strides1 == strides2: return True if len(strides1) != len(strides2) or len(strides2) != len(shape): return False for s, st1, st2 in zip(shape, strides1, strides2): if s != 1 and st1 != st2: return False return True def is_f_contiguous_strides(strides, itemsize, shape): from pytools import product return ( equal_strides(strides, f_contiguous_strides(itemsize, shape), shape) or product(shape) == 0) # noqa: W503 def is_c_contiguous_strides(strides, itemsize, shape): from pytools import product return (equal_strides(strides, c_contiguous_strides(itemsize, shape), shape) or product(shape) == 0) # noqa: W503 class ArrayFlags: def __init__(self, ary): self.f_contiguous = is_f_contiguous_strides( ary.strides, ary.dtype.itemsize, ary.shape) self.c_contiguous = is_c_contiguous_strides( ary.strides, ary.dtype.itemsize, ary.shape) self.forc = self.f_contiguous or self.c_contiguous def __repr__(self): return ( f" C_CONTIGUOUS : {self.c_contiguous}\n" f" F_CONTIGUOUS : {self.f_contiguous}" ) def __str__(self): return repr(self) def get_common_dtype(obj1, obj2, allow_double): # Yes, numpy behaves differently depending on whether # we're dealing with arrays or scalars. zero1 = np.zeros(1, dtype=obj1.dtype) try: zero2 = np.zeros(1, dtype=obj2.dtype) except AttributeError: zero2 = obj2 result = (zero1 + zero2).dtype if not allow_double: if result == np.float64: result = np.dtype(np.float32) elif result == np.complex128: result = np.dtype(np.complex64) return result def bound(a): high = a.bytes low = a.bytes for stri, shp in zip(a.strides, a.shape): if stri < 0: low += (stri)*(shp-1) else: high += (stri)*(shp-1) return low, high def may_share_memory(a, b): # When this is called with a an ndarray and b # a sparse matrix, numpy.may_share_memory fails. if a is b: return True if a.__class__ is b.__class__: a_l, a_h = bound(a) b_l, b_h = bound(b) if b_l >= a_h or a_l >= b_h: return False return True else: return False # {{{ as_strided implementation try: from numpy.lib.stride_tricks import as_strided as _as_strided _test_dtype = np.dtype( [("a", np.float64), ("b", np.float64)], align=True) _test_result = _as_strided(np.zeros(10, dtype=_test_dtype)) if _test_result.dtype != _test_dtype: raise RuntimeError("numpy's as_strided is broken") as_strided = _as_strided except Exception: # stolen from numpy to be compatible with older versions of numpy class _DummyArray: """ Dummy object that just exists to hang __array_interface__ dictionaries and possibly keep alive a reference to a base array. """ def __init__(self, interface, base=None): self.__array_interface__ = interface self.base = base def as_strided(x, shape=None, strides=None): """ Make an ndarray from the given array with the given shape and strides. """ # work around Numpy bug 1873 (reported by Irwin Zaid) # Since this is stolen from numpy, this implementation has the same bug. # http://projects.scipy.org/numpy/ticket/1873 # == https://github.com/numpy/numpy/issues/2466 # Do not recreate the array if nothing need to be changed. # This fixes a lot of errors on pypy since DummyArray hack does not # currently (2014/May/17) on pypy. if ((shape is None or x.shape == shape) and (strides is None or x.strides == strides)): # noqa: W503 return x if not x.dtype.isbuiltin: if shape is None: shape = x.shape strides = tuple(strides) from pytools import product if strides is not None and shape is not None \ and product(shape) == product(x.shape) \ and x.flags.forc: # Workaround: If we're being asked to do what amounts to a # contiguous reshape, at least do that. if strides == f_contiguous_strides(x.dtype.itemsize, shape): result = x.reshape(-1).reshape(*shape, order="F") assert result.strides == strides return result elif strides == c_contiguous_strides(x.dtype.itemsize, shape): result = x.reshape(-1).reshape(*shape, order="C") assert result.strides == strides return result raise NotImplementedError( "as_strided won't work on non-builtin arrays for now. " "See https://github.com/numpy/numpy/issues/2466") interface = dict(x.__array_interface__) if shape is not None: interface["shape"] = tuple(shape) if strides is not None: interface["strides"] = tuple(strides) return np.asarray(_DummyArray(interface, base=x)) # }}} pyopencl-2025.1/pyopencl/compyte/dtypes.py0000644000000000000000000002316314332717401015564 0ustar00"""Type mapping helpers.""" __copyright__ = "Copyright (C) 2011 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import numpy as np class TypeNameNotKnown(RuntimeError): # noqa: N818 pass # {{{ registry class DTypeRegistry: def __init__(self): self.dtype_to_name = {} self.name_to_dtype = {} def get_or_register_dtype(self, c_names, dtype=None): """Get or register a :class:`numpy.dtype` associated with the C type names in the string list *c_names*. If *dtype* is `None`, no registration is performed, and the :class:`numpy.dtype` must already have been registered. If so, it is returned. If not, :exc:`TypeNameNotKnown` is raised. If *dtype* is not `None`, registration is attempted. If the *c_names* are already known and registered to identical :class:`numpy.dtype` objects, then the previously dtype object of the previously registered type is returned. If the *c_names* are not yet known, the type is registered. If one of the *c_names* is known but registered to a different type, an error is raised. In this latter case, the type may end up partially registered and any further behavior is undefined. .. versionadded:: 2012.2 """ if isinstance(c_names, str): c_names = [c_names] if dtype is None: from pytools import single_valued return single_valued(self.name_to_dtype[name] for name in c_names) dtype = np.dtype(dtype) # check if we've seen an identical dtype, if so retrieve exact dtype object. try: existing_name = self.dtype_to_name[dtype] except KeyError: existed = False else: existed = True existing_dtype = self.name_to_dtype[existing_name] assert existing_dtype == dtype dtype = existing_dtype for nm in c_names: try: name_dtype = self.name_to_dtype[nm] except KeyError: self.name_to_dtype[nm] = dtype else: if name_dtype != dtype: raise RuntimeError("name '%s' already registered to " "different dtype" % nm) if not existed: self.dtype_to_name[dtype] = c_names[0] if str(dtype) not in self.dtype_to_name: self.dtype_to_name[str(dtype)] = c_names[0] return dtype def dtype_to_ctype(self, dtype): if dtype is None: raise ValueError("dtype may not be None") dtype = np.dtype(dtype) try: return self.dtype_to_name[dtype] except KeyError: raise ValueError("unable to map dtype '%s'" % dtype) from None # }}} # {{{ C types def fill_registry_with_c_types(reg, respect_windows, include_bool=True): import struct from sys import platform if include_bool: # bool is of unspecified size in the OpenCL spec and may in fact be # 4-byte. reg.get_or_register_dtype("bool", np.bool_) reg.get_or_register_dtype(["signed char", "char"], np.int8) reg.get_or_register_dtype("unsigned char", np.uint8) reg.get_or_register_dtype(["short", "signed short", "signed short int", "short signed int"], np.int16) reg.get_or_register_dtype(["unsigned short", "unsigned short int", "short unsigned int"], np.uint16) reg.get_or_register_dtype(["int", "signed int"], np.int32) reg.get_or_register_dtype(["unsigned", "unsigned int"], np.uint32) is_64_bit = struct.calcsize("@P") * 8 == 64 if is_64_bit: if "win32" in platform and respect_windows: i64_name = "long long" else: i64_name = "long" reg.get_or_register_dtype( [i64_name, "%s int" % i64_name, "signed %s int" % i64_name, "%s signed int" % i64_name], np.int64) reg.get_or_register_dtype( ["unsigned %s" % i64_name, "unsigned %s int" % i64_name, "%s unsigned int" % i64_name], np.uint64) # http://projects.scipy.org/numpy/ticket/2017 if is_64_bit: reg.get_or_register_dtype(["unsigned %s" % i64_name], np.uintp) else: reg.get_or_register_dtype(["unsigned"], np.uintp) reg.get_or_register_dtype("float", np.float32) reg.get_or_register_dtype("double", np.float64) def fill_registry_with_opencl_c_types(reg): reg.get_or_register_dtype(["char", "signed char"], np.int8) reg.get_or_register_dtype(["uchar", "unsigned char"], np.uint8) reg.get_or_register_dtype(["short", "signed short", "signed short int", "short signed int"], np.int16) reg.get_or_register_dtype(["ushort", "unsigned short", "unsigned short int", "short unsigned int"], np.uint16) reg.get_or_register_dtype(["int", "signed int"], np.int32) reg.get_or_register_dtype(["uint", "unsigned", "unsigned int"], np.uint32) reg.get_or_register_dtype( ["long", "long int", "signed long int", "long signed int"], np.int64) reg.get_or_register_dtype( ["ulong", "unsigned long", "unsigned long int", "long unsigned int"], np.uint64) reg.get_or_register_dtype(["intptr_t"], np.intp) reg.get_or_register_dtype(["uintptr_t"], np.uintp) reg.get_or_register_dtype("float", np.float32) reg.get_or_register_dtype("double", np.float64) def fill_registry_with_c99_stdint_types(reg): reg.get_or_register_dtype("bool", np.bool_) reg.get_or_register_dtype("int8_t", np.int8) reg.get_or_register_dtype("uint8_t", np.uint8) reg.get_or_register_dtype("int16_t", np.int16) reg.get_or_register_dtype("uint16_t", np.uint16) reg.get_or_register_dtype("int32_t", np.int32) reg.get_or_register_dtype("uint32_t", np.uint32) reg.get_or_register_dtype("int64_t", np.int64) reg.get_or_register_dtype("uint64_t", np.uint64) reg.get_or_register_dtype("uintptr_t", np.uintp) reg.get_or_register_dtype("float", np.float32) reg.get_or_register_dtype("double", np.float64) def fill_registry_with_c99_complex_types(reg): reg.get_or_register_dtype("float complex", np.complex64) reg.get_or_register_dtype("double complex", np.complex128) reg.get_or_register_dtype("long double complex", np.clongdouble) # }}} # {{{ backward compatibility TYPE_REGISTRY = DTypeRegistry() # These are deprecated and should no longer be used DTYPE_TO_NAME = TYPE_REGISTRY.dtype_to_name NAME_TO_DTYPE = TYPE_REGISTRY.name_to_dtype dtype_to_ctype = TYPE_REGISTRY.dtype_to_ctype get_or_register_dtype = TYPE_REGISTRY.get_or_register_dtype def _fill_dtype_registry(respect_windows, include_bool=True): fill_registry_with_c_types( TYPE_REGISTRY, respect_windows, include_bool) # }}} # {{{ c declarator parsing def parse_c_arg_backend(c_arg, scalar_arg_factory, vec_arg_factory, name_to_dtype=None): if isinstance(name_to_dtype, DTypeRegistry): name_to_dtype = name_to_dtype.name_to_dtype__getitem__ elif name_to_dtype is None: name_to_dtype = NAME_TO_DTYPE.__getitem__ c_arg = (c_arg .replace("const", "") .replace("volatile", "") .replace("__restrict__", "") .replace("restrict", "")) # process and remove declarator import re decl_re = re.compile(r"(\**)\s*([_a-zA-Z0-9]+)(\s*\[[ 0-9]*\])*\s*$") decl_match = decl_re.search(c_arg) if decl_match is None: raise ValueError("couldn't parse C declarator '%s'" % c_arg) name = decl_match.group(2) if decl_match.group(1) or decl_match.group(3) is not None: arg_class = vec_arg_factory else: arg_class = scalar_arg_factory tp = c_arg[:decl_match.start()] tp = " ".join(tp.split()) try: dtype = name_to_dtype(tp) except KeyError: raise ValueError("unknown type '%s'" % tp) from None return arg_class(dtype, name) # }}} def register_dtype(dtype, c_names, alias_ok=False): from warnings import warn warn("register_dtype is deprecated. Use get_or_register_dtype instead.", DeprecationWarning, stacklevel=2) if isinstance(c_names, str): c_names = [c_names] dtype = np.dtype(dtype) # check if we've seen this dtype before and error out if a) it was seen before # and b) alias_ok is False. if not alias_ok and dtype in TYPE_REGISTRY.dtype_to_name: raise RuntimeError("dtype '%s' already registered (as '%s', new names '%s')" % (dtype, TYPE_REGISTRY.dtype_to_name[dtype], ", ".join(c_names))) TYPE_REGISTRY.get_or_register_dtype(c_names, dtype) # vim: foldmethod=marker pyopencl-2025.1/pyopencl/compyte/pyproject.toml0000644000000000000000000000230214332717401016606 0ustar00[tool.ruff] preview = true [tool.ruff.lint] extend-select = [ "B", # flake8-bugbear "C", # flake8-comprehensions "E", # pycodestyle "F", # pyflakes "I", # flake8-isort "N", # pep8-naming "NPY", # numpy "Q", # flake8-quotes "W", # pycodestyle # TODO # "UP", # pyupgrade # "RUF", # ruff ] extend-ignore = [ "C90", # McCabe complexity "E221", # multiple spaces before operator "E241", # multiple spaces after comma "E402", # module level import not at the top of file "E226", # missing whitespace around operator "N817", # CamelCase `SubstitutionRuleMappingContext` imported as acronym `SRMC` # FIXME "NPY002", # numpy rng "C408", # unnecssary dict() -> literal "E265", # block comment should start with "F841", # local variable unused ] [tool.ruff.lint.per-file-ignores] "ndarray/**/*.py" = ["Q", "B", "E", "F", "N", "C4"] [tool.ruff.lint.flake8-quotes] docstring-quotes = "double" inline-quotes = "double" multiline-quotes = "double" [tool.ruff.lint.isort] combine-as-imports = true known-first-party = [ "pytools", "pymbolic", ] known-local-folder = [ "modepy", ] lines-after-imports = 2 pyopencl-2025.1/pyopencl/elementwise.py0000644000000000000000000011334114332717401015113 0ustar00"""Elementwise functionality.""" __copyright__ = "Copyright (C) 2009 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import enum from typing import Any, List, Optional, Tuple, Union import numpy as np from pytools import memoize_method import pyopencl as cl from pyopencl.tools import ( DtypedArgument, KernelTemplateBase, ScalarArg, VectorArg, context_dependent_memoize, dtype_to_c_struct, dtype_to_ctype, ) # {{{ elementwise kernel code generator def get_elwise_program( context: cl.Context, arguments: List[DtypedArgument], operation: str, *, name: str = "elwise_kernel", options: Any = None, preamble: str = "", loop_prep: str = "", after_loop: str = "", use_range: bool = False) -> cl.Program: if use_range: body = r"""//CL// if (step < 0) { for (i = start + (work_group_start + lid)*step; i > stop; i += gsize*step) { %(operation)s; } } else { for (i = start + (work_group_start + lid)*step; i < stop; i += gsize*step) { %(operation)s; } } """ else: body = """//CL// for (i = work_group_start + lid; i < n; i += gsize) { %(operation)s; } """ import re return_match = re.search(r"\breturn\b", operation) if return_match is not None: from warnings import warn warn("Using a 'return' statement in an element-wise operation will " "likely lead to incorrect results. Use " "PYOPENCL_ELWISE_CONTINUE instead.", stacklevel=3) source = (f"""//CL// {preamble} #define PYOPENCL_ELWISE_CONTINUE continue __kernel void {name}({", ".join(arg.declarator() for arg in arguments)}) {{ int lid = get_local_id(0); int gsize = get_global_size(0); int work_group_start = get_local_size(0)*get_group_id(0); long i; {loop_prep}; {body % {"operation": operation}} {after_loop}; }} """) return cl.Program(context, source).build(options) def get_elwise_kernel_and_types( context: cl.Context, arguments: Union[str, List[DtypedArgument]], operation: str, *, name: str = "elwise_kernel", options: Any = None, preamble: str = "", use_range: bool = False, **kwargs: Any) -> Tuple[cl.Kernel, List[DtypedArgument]]: from pyopencl.tools import get_arg_offset_adjuster_code, parse_arg_list parsed_args = parse_arg_list(arguments, with_offset=True) auto_preamble = kwargs.pop("auto_preamble", True) pragmas = [] includes = [] have_double_pragma = False have_complex_include = False if auto_preamble: for arg in parsed_args: if arg.dtype in [np.float64, np.complex128]: if not have_double_pragma: pragmas.append(""" #if __OPENCL_C_VERSION__ < 120 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif #define PYOPENCL_DEFINE_CDOUBLE """) have_double_pragma = True if arg.dtype.kind == "c": if not have_complex_include: includes.append("#include \n") have_complex_include = True if pragmas or includes: preamble = "\n".join(pragmas+includes) + "\n" + preamble if use_range: parsed_args.extend([ ScalarArg(np.intp, "start"), ScalarArg(np.intp, "stop"), ScalarArg(np.intp, "step"), ]) else: parsed_args.append(ScalarArg(np.intp, "n")) loop_prep = kwargs.pop("loop_prep", "") loop_prep = get_arg_offset_adjuster_code(parsed_args) + loop_prep prg = get_elwise_program( context, parsed_args, operation, name=name, options=options, preamble=preamble, use_range=use_range, loop_prep=loop_prep, **kwargs) from pyopencl.tools import get_arg_list_arg_types kernel = getattr(prg, name) kernel.set_scalar_arg_dtypes(get_arg_list_arg_types(parsed_args)) return kernel, parsed_args def get_elwise_kernel( context: cl.Context, arguments: Union[str, List[DtypedArgument]], operation: str, *, name: str = "elwise_kernel", options: Any = None, **kwargs: Any) -> cl.Kernel: """Return a L{pyopencl.Kernel} that performs the same scalar operation on one or several vectors. """ func, arguments = get_elwise_kernel_and_types( context, arguments, operation, name=name, options=options, **kwargs) return func # }}} # {{{ ElementwiseKernel driver class ElementwiseKernel: """ A kernel that takes a number of scalar or vector *arguments* and performs an *operation* specified as a snippet of C on these arguments. :arg arguments: a string formatted as a C argument list. :arg operation: a snippet of C that carries out the desired 'map' operation. The current index is available as the variable *i*. *operation* may contain the statement ``PYOPENCL_ELWISE_CONTINUE``, which will terminate processing for the current element. :arg name: the function name as which the kernel is compiled :arg options: passed unmodified to :meth:`pyopencl.Program.build`. :arg preamble: a piece of C source code that gets inserted outside of the function context in the elementwise operation's kernel source code. .. warning :: Using a ``return`` statement in *operation* will lead to incorrect results, as some elements may never get processed. Use ``PYOPENCL_ELWISE_CONTINUE`` instead. .. versionchanged:: 2013.1 Added ``PYOPENCL_ELWISE_CONTINUE``. .. automethod:: __call__ """ def __init__( self, context: cl.Context, arguments: Union[str, List[DtypedArgument]], operation: str, name: str = "elwise_kernel", options: Any = None, **kwargs: Any) -> None: self.context = context self.arguments = arguments self.operation = operation self.name = name self.options = options self.kwargs = kwargs @memoize_method def get_kernel(self, use_range: bool): knl, arg_descrs = get_elwise_kernel_and_types( self.context, self.arguments, self.operation, name=self.name, options=self.options, use_range=use_range, **self.kwargs) for arg in arg_descrs: if isinstance(arg, VectorArg) and not arg.with_offset: from warnings import warn warn( f"ElementwiseKernel '{self.name}' used with VectorArgs " "that do not have offset support enabled. This usage is " "deprecated. Just pass with_offset=True to VectorArg, " "everything should sort itself out automatically.", DeprecationWarning, stacklevel=2) if not any(isinstance(arg, VectorArg) for arg in arg_descrs): raise RuntimeError( "ElementwiseKernel can only be used with functions that have " "at least one vector argument") return knl, arg_descrs def __call__(self, *args, **kwargs) -> cl.Event: """ Invoke the generated scalar kernel. The arguments may either be scalars or :class:`pyopencl.array.Array` instances. |std-enqueue-blurb| """ range_ = kwargs.pop("range", None) slice_ = kwargs.pop("slice", None) capture_as = kwargs.pop("capture_as", None) queue = kwargs.pop("queue", None) wait_for = kwargs.pop("wait_for", None) if kwargs: raise TypeError(f"unknown keyword arguments: '{', '.join(kwargs)}'") use_range = range_ is not None or slice_ is not None kernel, arg_descrs = self.get_kernel(use_range) if wait_for is None: wait_for = [] else: # We'll be modifying it below. wait_for = list(wait_for) # {{{ assemble arg array repr_vec = None invocation_args = [] for arg, arg_descr in zip(args, arg_descrs): if isinstance(arg_descr, VectorArg): if repr_vec is None: repr_vec = arg invocation_args.append(arg) else: invocation_args.append(arg) assert repr_vec is not None # }}} if queue is None: queue = repr_vec.queue if slice_ is not None: if range_ is not None: raise TypeError( "may not specify both range and slice keyword arguments") range_ = slice(*slice_.indices(repr_vec.size)) max_wg_size = kernel.get_work_group_info( cl.kernel_work_group_info.WORK_GROUP_SIZE, queue.device) if range_ is not None: start = range_.start if start is None: start = 0 invocation_args.append(start) invocation_args.append(range_.stop) if range_.step is None: step = 1 else: step = range_.step invocation_args.append(step) from pyopencl.array import _splay gs, ls = _splay(queue.device, abs(range_.stop - start)//step, max_wg_size) else: invocation_args.append(repr_vec.size) gs, ls = repr_vec._get_sizes(queue, max_wg_size) if capture_as is not None: kernel.set_args(*invocation_args) kernel.capture_call( capture_as, queue, gs, ls, *invocation_args, wait_for=wait_for) return kernel(queue, gs, ls, *invocation_args, wait_for=wait_for) # }}} # {{{ template class ElementwiseTemplate(KernelTemplateBase): def __init__( self, arguments: Union[str, List[DtypedArgument]], operation: str, name: str = "elwise", preamble: str = "", template_processor: Optional[str] = None) -> None: super().__init__(template_processor=template_processor) self.arguments = arguments self.operation = operation self.name = name self.preamble = preamble def build_inner(self, context, type_aliases=(), var_values=(), more_preamble="", more_arguments=(), declare_types=(), options=None): renderer = self.get_renderer( type_aliases, var_values, context, options) arg_list = renderer.render_argument_list( self.arguments, more_arguments, with_offset=True) type_decl_preamble = renderer.get_type_decl_preamble( context.devices[0], declare_types, arg_list) return ElementwiseKernel(context, arg_list, renderer(self.operation), name=renderer(self.name), options=options, preamble=( type_decl_preamble + "\n" + renderer(self.preamble + "\n" + more_preamble)), auto_preamble=False) # }}} # {{{ argument kinds class ArgumentKind(enum.Enum): ARRAY = enum.auto() DEV_SCALAR = enum.auto() SCALAR = enum.auto() def get_argument_kind(v: Any) -> ArgumentKind: from pyopencl.array import Array if isinstance(v, Array): if v.shape == (): return ArgumentKind.DEV_SCALAR else: return ArgumentKind.ARRAY else: return ArgumentKind.SCALAR def get_decl_and_access_for_kind(name: str, kind: ArgumentKind) -> Tuple[str, str]: if kind == ArgumentKind.ARRAY: return f"*{name}", f"{name}[i]" elif kind == ArgumentKind.SCALAR: return f"{name}", name elif kind == ArgumentKind.DEV_SCALAR: return f"*{name}", f"{name}[0]" else: raise AssertionError() # }}} # {{{ kernels supporting array functionality @context_dependent_memoize def get_take_kernel(context, dtype, idx_dtype, vec_count=1): idx_tp = dtype_to_ctype(idx_dtype) args = ([VectorArg(dtype, f"dest{i}", with_offset=True) for i in range(vec_count)] + [VectorArg(dtype, f"src{i}", with_offset=True) for i in range(vec_count)] + [VectorArg(idx_dtype, "idx", with_offset=True)]) body = ( f"{idx_tp} src_idx = idx[i];\n" + "\n".join( f"dest{i}[i] = src{i}[src_idx];" for i in range(vec_count)) ) return get_elwise_kernel(context, args, body, preamble=dtype_to_c_struct(context.devices[0], dtype), name="take") @context_dependent_memoize def get_take_put_kernel(context, dtype, idx_dtype, with_offsets, vec_count=1): idx_tp = dtype_to_ctype(idx_dtype) args = [ VectorArg(dtype, f"dest{i}") for i in range(vec_count) ] + [ VectorArg(idx_dtype, "gmem_dest_idx", with_offset=True), VectorArg(idx_dtype, "gmem_src_idx", with_offset=True), ] + [ VectorArg(dtype, f"src{i}", with_offset=True) for i in range(vec_count) ] + [ ScalarArg(idx_dtype, f"offset{i}") for i in range(vec_count) if with_offsets ] if with_offsets: def get_copy_insn(i): return f"dest{i}[dest_idx] = src{i}[src_idx + offset{i}];" else: def get_copy_insn(i): return f"dest{i}[dest_idx] = src{i}[src_idx];" body = ((f"{idx_tp} src_idx = gmem_src_idx[i];\n" f"{idx_tp} dest_idx = gmem_dest_idx[i];\n") + "\n".join(get_copy_insn(i) for i in range(vec_count))) return get_elwise_kernel(context, args, body, preamble=dtype_to_c_struct(context.devices[0], dtype), name="take_put") @context_dependent_memoize def get_put_kernel(context, dtype, idx_dtype, vec_count=1): idx_tp = dtype_to_ctype(idx_dtype) args = [ VectorArg(dtype, f"dest{i}", with_offset=True) for i in range(vec_count) ] + [ VectorArg(idx_dtype, "gmem_dest_idx", with_offset=True), ] + [ VectorArg(dtype, f"src{i}", with_offset=True) for i in range(vec_count) ] + [ VectorArg(np.uint8, "use_fill", with_offset=True) ] + [ VectorArg(np.int64, "val_ary_lengths", with_offset=True) ] body = ( f"{idx_tp} dest_idx = gmem_dest_idx[i];\n" + "\n".join( f"dest{i}[dest_idx] = (use_fill[{i}] ? src{i}[0] : " f"src{i}[i % val_ary_lengths[{i}]]);" for i in range(vec_count) ) ) return get_elwise_kernel(context, args, body, preamble=dtype_to_c_struct(context.devices[0], dtype), name="put") @context_dependent_memoize def get_copy_kernel(context, dtype_dest, dtype_src): src = "src[i]" if dtype_dest.kind == "c" != dtype_src.kind: name = complex_dtype_to_name(dtype_dest) src = f"{name}_fromreal({src})" if dtype_dest.kind == "c" and dtype_src != dtype_dest: name = complex_dtype_to_name(dtype_dest) src = f"{name}_cast({src})" if dtype_dest != dtype_src and ( dtype_dest.kind == "V" or dtype_src.kind == "V"): raise TypeError("copying between non-identical struct types") return get_elwise_kernel(context, "{tp_dest} *dest, {tp_src} *src".format( tp_dest=dtype_to_ctype(dtype_dest), tp_src=dtype_to_ctype(dtype_src), ), f"dest[i] = {src}", preamble=dtype_to_c_struct(context.devices[0], dtype_dest), name="copy") def complex_dtype_to_name(dtype) -> str: if dtype == np.complex128: return "cdouble" elif dtype == np.complex64: return "cfloat" else: raise RuntimeError(f"invalid complex type: {dtype}") def real_dtype(dtype): return dtype.type(0).real.dtype @context_dependent_memoize def get_axpbyz_kernel(context, dtype_x, dtype_y, dtype_z, x_is_scalar=False, y_is_scalar=False): result_t = dtype_to_ctype(dtype_z) x_is_complex = dtype_x.kind == "c" y_is_complex = dtype_y.kind == "c" x = "x[0]" if x_is_scalar else "x[i]" y = "y[0]" if y_is_scalar else "y[i]" if dtype_z.kind == "c": # a and b will always be complex here. z_ct = complex_dtype_to_name(dtype_z) if x_is_complex: ax = f"{z_ct}_mul(a, {z_ct}_cast({x}))" else: ax = f"{z_ct}_mulr(a, {x})" if y_is_complex: by = f"{z_ct}_mul(b, {z_ct}_cast({y}))" else: by = f"{z_ct}_mulr(b, {y})" result = f"{z_ct}_add({ax}, {by})" else: # real-only ax = f"a*(({result_t}) {x})" by = f"b*(({result_t}) {y})" result = f"{ax} + {by}" return get_elwise_kernel(context, "{tp_z} *z, {tp_z} a, {tp_x} *x, {tp_z} b, {tp_y} *y".format( tp_x=dtype_to_ctype(dtype_x), tp_y=dtype_to_ctype(dtype_y), tp_z=dtype_to_ctype(dtype_z), ), f"z[i] = {result}", name="axpbyz") @context_dependent_memoize def get_axpbz_kernel(context, dtype_a, dtype_x, dtype_b, dtype_z): a_is_complex = dtype_a.kind == "c" x_is_complex = dtype_x.kind == "c" b_is_complex = dtype_b.kind == "c" z_is_complex = dtype_z.kind == "c" ax = "a*x[i]" if x_is_complex: a = "a" x = "x[i]" if dtype_x != dtype_z: x = "{}_cast({})".format(complex_dtype_to_name(dtype_z), x) if a_is_complex: if dtype_a != dtype_z: a = "{}_cast({})".format(complex_dtype_to_name(dtype_z), a) ax = "{}_mul({}, {})".format(complex_dtype_to_name(dtype_z), a, x) else: ax = "{}_rmul({}, {})".format(complex_dtype_to_name(dtype_z), a, x) elif a_is_complex: a = "a" x = "x[i]" if dtype_a != dtype_z: a = "{}_cast({})".format(complex_dtype_to_name(dtype_z), a) ax = "{}_mulr({}, {})".format(complex_dtype_to_name(dtype_z), a, x) b = "b" if z_is_complex and not b_is_complex: b = "{}_fromreal({})".format(complex_dtype_to_name(dtype_z), b) if z_is_complex and not (a_is_complex or x_is_complex): ax = "{}_fromreal({})".format(complex_dtype_to_name(dtype_z), ax) if z_is_complex: ax = "{}_cast({})".format(complex_dtype_to_name(dtype_z), ax) b = "{}_cast({})".format(complex_dtype_to_name(dtype_z), b) if a_is_complex or x_is_complex or b_is_complex: expr = "{root}_add({ax}, {b})".format( ax=ax, b=b, root=complex_dtype_to_name(dtype_z)) else: expr = f"{ax} + {b}" return get_elwise_kernel(context, "{tp_z} *z, {tp_a} a, {tp_x} *x,{tp_b} b".format( tp_a=dtype_to_ctype(dtype_a), tp_x=dtype_to_ctype(dtype_x), tp_b=dtype_to_ctype(dtype_b), tp_z=dtype_to_ctype(dtype_z), ), f"z[i] = {expr}", name="axpb") @context_dependent_memoize def get_multiply_kernel(context, dtype_x, dtype_y, dtype_z, x_is_scalar=False, y_is_scalar=False): x_is_complex = dtype_x.kind == "c" y_is_complex = dtype_y.kind == "c" x = "x[0]" if x_is_scalar else "x[i]" y = "y[0]" if y_is_scalar else "y[i]" if x_is_complex and dtype_x != dtype_z: x = "{}_cast({})".format(complex_dtype_to_name(dtype_z), x) if y_is_complex and dtype_y != dtype_z: y = "{}_cast({})".format(complex_dtype_to_name(dtype_z), y) if x_is_complex and y_is_complex: xy = "{}_mul({}, {})".format(complex_dtype_to_name(dtype_z), x, y) elif x_is_complex and not y_is_complex: xy = "{}_mulr({}, {})".format(complex_dtype_to_name(dtype_z), x, y) elif not x_is_complex and y_is_complex: xy = "{}_rmul({}, {})".format(complex_dtype_to_name(dtype_z), x, y) else: xy = f"{x} * {y}" return get_elwise_kernel(context, "{tp_z} *z, {tp_x} *x, {tp_y} *y".format( tp_x=dtype_to_ctype(dtype_x), tp_y=dtype_to_ctype(dtype_y), tp_z=dtype_to_ctype(dtype_z), ), f"z[i] = {xy}", name="multiply") @context_dependent_memoize def get_divide_kernel(context, dtype_x, dtype_y, dtype_z, x_is_scalar=False, y_is_scalar=False): x_is_complex = dtype_x.kind == "c" y_is_complex = dtype_y.kind == "c" z_is_complex = dtype_z.kind == "c" x = "x[0]" if x_is_scalar else "x[i]" y = "y[0]" if y_is_scalar else "y[i]" if z_is_complex and dtype_x != dtype_y: if x_is_complex and dtype_x != dtype_z: x = "{}_cast({})".format(complex_dtype_to_name(dtype_z), x) if y_is_complex and dtype_y != dtype_z: y = "{}_cast({})".format(complex_dtype_to_name(dtype_z), y) else: if dtype_x != dtype_z: x = f"({dtype_to_ctype(dtype_z)}) ({x})" if dtype_y != dtype_z: y = f"({dtype_to_ctype(dtype_z)}) ({y})" if x_is_complex and y_is_complex: xoy = "{}_divide({}, {})".format(complex_dtype_to_name(dtype_z), x, y) elif not x_is_complex and y_is_complex: xoy = "{}_rdivide({}, {})".format(complex_dtype_to_name(dtype_z), x, y) elif x_is_complex and not y_is_complex: xoy = "{}_divider({}, {})".format(complex_dtype_to_name(dtype_z), x, y) else: xoy = f"{x} / {y}" if z_is_complex: xoy = "{}_cast({})".format(complex_dtype_to_name(dtype_z), xoy) return get_elwise_kernel(context, "{tp_z} *z, {tp_x} *x, {tp_y} *y".format( tp_x=dtype_to_ctype(dtype_x), tp_y=dtype_to_ctype(dtype_y), tp_z=dtype_to_ctype(dtype_z), ), f"z[i] = {xoy}", name="divide") @context_dependent_memoize def get_rdivide_elwise_kernel(context, dtype_x, dtype_y, dtype_z): # implements y / x! x_is_complex = dtype_x.kind == "c" y_is_complex = dtype_y.kind == "c" z_is_complex = dtype_z.kind == "c" x = "x[i]" y = "y" if z_is_complex and dtype_x != dtype_y: if x_is_complex and dtype_x != dtype_z: x = "{}_cast({})".format(complex_dtype_to_name(dtype_z), x) if y_is_complex and dtype_y != dtype_z: y = "{}_cast({})".format(complex_dtype_to_name(dtype_z), y) if x_is_complex and y_is_complex: yox = "{}_divide({}, {})".format(complex_dtype_to_name(dtype_z), y, x) elif not y_is_complex and x_is_complex: yox = "{}_rdivide({}, {})".format(complex_dtype_to_name(dtype_z), y, x) elif y_is_complex and not x_is_complex: yox = "{}_divider({}, {})".format(complex_dtype_to_name(dtype_z), y, x) else: yox = f"{y} / {x}" return get_elwise_kernel(context, "{tp_z} *z, {tp_x} *x, {tp_y} y".format( tp_x=dtype_to_ctype(dtype_x), tp_y=dtype_to_ctype(dtype_y), tp_z=dtype_to_ctype(dtype_z), ), f"z[i] = {yox}", name="divide_r") @context_dependent_memoize def get_fill_kernel(context, dtype): return get_elwise_kernel(context, "{tp} *z, {tp} a".format(tp=dtype_to_ctype(dtype)), "z[i] = a", preamble=dtype_to_c_struct(context.devices[0], dtype), name="fill") @context_dependent_memoize def get_reverse_kernel(context, dtype): return get_elwise_kernel(context, "{tp} *z, {tp} *y".format(tp=dtype_to_ctype(dtype)), "z[i] = y[n-1-i]", name="reverse") @context_dependent_memoize def get_arange_kernel(context, dtype): if dtype.kind == "c": expr = ( "{root}_add(start, {root}_rmul(i, step))" .format(root=complex_dtype_to_name(dtype))) else: expr = f"start + (({dtype_to_ctype(dtype)}) i) * step" return get_elwise_kernel(context, [ VectorArg(dtype, "z", with_offset=True), ScalarArg(dtype, "start"), ScalarArg(dtype, "step"), ], f"z[i] = {expr}", name="arange") @context_dependent_memoize def get_pow_kernel(context, dtype_x, dtype_y, dtype_z, is_base_array, is_exp_array): if is_base_array: x = "x[i]" x_ctype = "{tp_x} *x" else: x = "x" x_ctype = "{tp_x} x" if is_exp_array: y = "y[i]" y_ctype = "{tp_y} *y" else: y = "y" y_ctype = "{tp_y} y" x_is_complex = dtype_x.kind == "c" y_is_complex = dtype_y.kind == "c" z_is_complex = dtype_z.kind == "c" if z_is_complex and dtype_x != dtype_y: if x_is_complex and dtype_x != dtype_z: x = "{}_cast({})".format(complex_dtype_to_name(dtype_z), x) if y_is_complex and dtype_y != dtype_z: y = "{}_cast({})".format(complex_dtype_to_name(dtype_z), y) elif dtype_x != dtype_y: if dtype_x != dtype_z: x = "({}) ({})".format(dtype_to_ctype(dtype_z), x) if dtype_y != dtype_z: y = "({}) ({})".format(dtype_to_ctype(dtype_z), y) if x_is_complex and y_is_complex: result = "{}_pow({}, {})".format(complex_dtype_to_name(dtype_z), x, y) elif x_is_complex and not y_is_complex: result = "{}_powr({}, {})".format(complex_dtype_to_name(dtype_z), x, y) elif not x_is_complex and y_is_complex: result = "{}_rpow({}, {})".format(complex_dtype_to_name(dtype_z), x, y) else: result = f"pow({x}, {y})" return get_elwise_kernel(context, ("{tp_z} *z, " + x_ctype + ", " + y_ctype).format( tp_x=dtype_to_ctype(dtype_x), tp_y=dtype_to_ctype(dtype_y), tp_z=dtype_to_ctype(dtype_z), ), f"z[i] = {result}", name="pow_method") @context_dependent_memoize def get_unop_kernel(context, operator, res_dtype, in_dtype): return get_elwise_kernel(context, [ VectorArg(res_dtype, "z", with_offset=True), VectorArg(in_dtype, "y", with_offset=True), ], f"z[i] = {operator} y[i]", name="unary_op_kernel") @context_dependent_memoize def get_array_scalar_binop_kernel(context, operator, dtype_res, dtype_a, dtype_b): return get_elwise_kernel(context, [ VectorArg(dtype_res, "out", with_offset=True), VectorArg(dtype_a, "a", with_offset=True), ScalarArg(dtype_b, "b"), ], f"out[i] = a[i] {operator} b", name="scalar_binop_kernel") @context_dependent_memoize def get_array_binop_kernel(context, operator, dtype_res, dtype_a, dtype_b, a_is_scalar=False, b_is_scalar=False): a = "a[0]" if a_is_scalar else "a[i]" b = "b[0]" if b_is_scalar else "b[i]" return get_elwise_kernel(context, [ VectorArg(dtype_res, "out", with_offset=True), VectorArg(dtype_a, "a", with_offset=True), VectorArg(dtype_b, "b", with_offset=True), ], f"out[i] = {a} {operator} {b}", name="binop_kernel") @context_dependent_memoize def get_array_scalar_comparison_kernel(context, operator, dtype_a): return get_elwise_kernel(context, [ VectorArg(np.int8, "out", with_offset=True), VectorArg(dtype_a, "a", with_offset=True), ScalarArg(dtype_a, "b"), ], f"out[i] = a[i] {operator} b", name="scalar_comparison_kernel") @context_dependent_memoize def get_array_comparison_kernel(context, operator, dtype_a, dtype_b): return get_elwise_kernel(context, [ VectorArg(np.int8, "out", with_offset=True), VectorArg(dtype_a, "a", with_offset=True), VectorArg(dtype_b, "b", with_offset=True), ], f"out[i] = a[i] {operator} b[i]", name="comparison_kernel") @context_dependent_memoize def get_unary_func_kernel(context, func_name, in_dtype, out_dtype=None): if out_dtype is None: out_dtype = in_dtype return get_elwise_kernel(context, [ VectorArg(out_dtype, "z", with_offset=True), VectorArg(in_dtype, "y", with_offset=True), ], f"z[i] = {func_name}(y[i])", name=f"{func_name}_kernel") @context_dependent_memoize def get_binary_func_kernel(context, func_name, x_dtype, y_dtype, out_dtype, preamble="", name=None): if name is None: name = func_name return get_elwise_kernel(context, [ VectorArg(out_dtype, "z", with_offset=True), VectorArg(x_dtype, "x", with_offset=True), VectorArg(y_dtype, "y", with_offset=True), ], f"z[i] = {func_name}(x[i], y[i])", name=f"{name}_kernel", preamble=preamble) @context_dependent_memoize def get_float_binary_func_kernel(context, func_name, x_dtype, y_dtype, out_dtype, preamble="", name=None): if name is None: name = func_name if (np.array(0, x_dtype) * np.array(0, y_dtype)).itemsize > 4: arg_type = "double" preamble = """ #if __OPENCL_C_VERSION__ < 120 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif #define PYOPENCL_DEFINE_CDOUBLE """ + preamble else: arg_type = "float" return get_elwise_kernel(context, [ VectorArg(out_dtype, "z", with_offset=True), VectorArg(x_dtype, "x", with_offset=True), VectorArg(y_dtype, "y", with_offset=True), ], f"z[i] = {func_name}(({arg_type})x[i], ({arg_type})y[i])", name=f"{name}_kernel", preamble=preamble) @context_dependent_memoize def get_fmod_kernel(context, out_dtype=np.float32, arg_dtype=np.float32, mod_dtype=np.float32): return get_float_binary_func_kernel(context, "fmod", arg_dtype, mod_dtype, out_dtype) @context_dependent_memoize def get_modf_kernel(context, int_dtype=np.float32, frac_dtype=np.float32, x_dtype=np.float32): return get_elwise_kernel(context, [ VectorArg(int_dtype, "intpart", with_offset=True), VectorArg(frac_dtype, "fracpart", with_offset=True), VectorArg(x_dtype, "x", with_offset=True), ], """ fracpart[i] = modf(x[i], &intpart[i]) """, name="modf_kernel") @context_dependent_memoize def get_frexp_kernel(context, sign_dtype=np.float32, exp_dtype=np.float32, x_dtype=np.float32): return get_elwise_kernel(context, [ VectorArg(sign_dtype, "significand", with_offset=True), VectorArg(exp_dtype, "exponent", with_offset=True), VectorArg(x_dtype, "x", with_offset=True), ], """ int expt = 0; significand[i] = frexp(x[i], &expt); exponent[i] = expt; """, name="frexp_kernel") @context_dependent_memoize def get_ldexp_kernel(context, out_dtype=np.float32, sig_dtype=np.float32, expt_dtype=np.float32): return get_binary_func_kernel( context, "_PYOCL_LDEXP", sig_dtype, expt_dtype, out_dtype, preamble="#define _PYOCL_LDEXP(x, y) ldexp(x, (int)(y))", name="ldexp_kernel") @context_dependent_memoize def get_minmaximum_kernel(context, minmax, dtype_z, dtype_x, dtype_y, kind_x: ArgumentKind, kind_y: ArgumentKind): if dtype_z.kind == "f": reduce_func = f"f{minmax}_nanprop" elif dtype_z.kind in "iu": reduce_func = minmax else: raise TypeError("unsupported dtype specified") tp_x = dtype_to_ctype(dtype_x) tp_y = dtype_to_ctype(dtype_y) tp_z = dtype_to_ctype(dtype_z) decl_x, acc_x = get_decl_and_access_for_kind("x", kind_x) decl_y, acc_y = get_decl_and_access_for_kind("y", kind_y) return get_elwise_kernel(context, f"{tp_z} *z, {tp_x} {decl_x}, {tp_y} {decl_y}", f"z[i] = {reduce_func}({acc_x}, {acc_y})", name=f"{minmax}imum", preamble=""" #define fmin_nanprop(a, b) (isnan(a) || isnan(b)) ? a+b : fmin(a, b) #define fmax_nanprop(a, b) (isnan(a) || isnan(b)) ? a+b : fmax(a, b) """) @context_dependent_memoize def get_bessel_kernel(context, which_func, out_dtype=np.float64, order_dtype=np.int32, x_dtype=np.float64): if x_dtype.kind != "c": return get_elwise_kernel(context, [ VectorArg(out_dtype, "z", with_offset=True), ScalarArg(order_dtype, "ord_n"), VectorArg(x_dtype, "x", with_offset=True), ], f"z[i] = bessel_{which_func}n(ord_n, x[i])", name=f"bessel_{which_func}n_kernel", preamble=f""" #if __OPENCL_C_VERSION__ < 120 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif #define PYOPENCL_DEFINE_CDOUBLE #include """) else: if which_func != "j": raise NotImplementedError("complex arguments for Bessel Y") if x_dtype != np.complex128: raise NotImplementedError("non-complex double dtype") if x_dtype != out_dtype: raise NotImplementedError("different input/output types") return get_elwise_kernel(context, [ VectorArg(out_dtype, "z", with_offset=True), ScalarArg(order_dtype, "ord_n"), VectorArg(x_dtype, "x", with_offset=True), ], """ cdouble_t jv_loc; cdouble_t jvp1_loc; bessel_j_complex(ord_n, x[i], &jv_loc, &jvp1_loc); z[i] = jv_loc; """, name="bessel_j_complex_kernel", preamble=""" #if __OPENCL_C_VERSION__ < 120 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif #define PYOPENCL_DEFINE_CDOUBLE #include #include """) @context_dependent_memoize def get_hankel_01_kernel(context, out_dtype, x_dtype): if x_dtype != np.complex128: raise NotImplementedError("non-complex double dtype") if x_dtype != out_dtype: raise NotImplementedError("different input/output types") return get_elwise_kernel(context, [ VectorArg(out_dtype, "h0", with_offset=True), VectorArg(out_dtype, "h1", with_offset=True), VectorArg(x_dtype, "x", with_offset=True), ], """ cdouble_t h0_loc; cdouble_t h1_loc; hankel_01_complex(x[i], &h0_loc, &h1_loc, 1); h0[i] = h0_loc; h1[i] = h1_loc; """, name="hankel_complex_kernel", preamble=""" #if __OPENCL_C_VERSION__ < 120 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif #define PYOPENCL_DEFINE_CDOUBLE #include #include """) @context_dependent_memoize def get_diff_kernel(context, dtype): return get_elwise_kernel(context, [ VectorArg(dtype, "result", with_offset=True), VectorArg(dtype, "array", with_offset=True), ], "result[i] = array[i+1] - array[i]", name="diff") @context_dependent_memoize def get_if_positive_kernel( context, crit_dtype, then_else_dtype, is_then_array, is_else_array, is_then_scalar, is_else_scalar): if is_then_array: then_ = "then_[0]" if is_then_scalar else "then_[i]" then_arg = VectorArg(then_else_dtype, "then_", with_offset=True) else: assert is_then_scalar then_ = "then_" then_arg = ScalarArg(then_else_dtype, "then_") if is_else_array: else_ = "else_[0]" if is_else_scalar else "else_[i]" else_arg = VectorArg(then_else_dtype, "else_", with_offset=True) else: assert is_else_scalar else_ = "else_" else_arg = ScalarArg(then_else_dtype, "else_") return get_elwise_kernel(context, [ VectorArg(then_else_dtype, "result", with_offset=True), VectorArg(crit_dtype, "crit", with_offset=True), then_arg, else_arg, ], f"result[i] = crit[i] > 0 ? {then_} : {else_}", name="if_positive") @context_dependent_memoize def get_logical_not_kernel(context, in_dtype): return get_elwise_kernel(context, [ VectorArg(np.int8, "z", with_offset=True), VectorArg(in_dtype, "y", with_offset=True), ], "z[i] = (y[i] == 0)", name="logical_not_kernel") # }}} # vim: fdm=marker pyopencl-2025.1/pyopencl/invoker.py0000644000000000000000000003335414332717401014254 0ustar00__copyright__ = """ Copyright (C) 2017 Andreas Kloeckner """ __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ from typing import Any, Tuple from warnings import warn import numpy as np from pytools.persistent_dict import WriteOncePersistentDict from pytools.py_codegen import Indentation, PythonCodeGenerator import pyopencl as cl import pyopencl._cl as _cl from pyopencl.tools import VectorArg, _NumpyTypesKeyBuilder # {{{ arg packing helpers _size_t_char = ({ 8: "Q", 4: "L", 2: "H", 1: "B", })[_cl._sizeof_size_t()] _type_char_map = { "n": _size_t_char.lower(), "N": _size_t_char } del _size_t_char # }}} # {{{ generic arg handling body def generate_generic_arg_handling_body(num_args): gen = PythonCodeGenerator() if num_args == 0: gen("pass") else: gen_indices_and_args = [] for i in range(num_args): gen_indices_and_args.append(i) gen_indices_and_args.append(f"arg{i}") gen(f"self._set_arg_multi(" f"({', '.join(str(i) for i in gen_indices_and_args)},), " ")") return gen # }}} # {{{ specific arg handling body BUF_PACK_TYPECHARS = ["c", "b", "B", "h", "H", "i", "I", "l", "L", "f", "d"] def generate_specific_arg_handling_body(function_name, num_cl_args, arg_types, *, work_around_arg_count_bug, warn_about_arg_count_bug, in_enqueue, include_debug_code): assert work_around_arg_count_bug is not None assert warn_about_arg_count_bug is not None fp_arg_count = 0 cl_arg_idx = 0 gen = PythonCodeGenerator() if not arg_types: gen("pass") gen_indices_and_args = [] buf_indices_and_args = [] buf_pack_indices_and_args = [] def add_buf_arg(arg_idx, typechar, expr_str): if typechar in BUF_PACK_TYPECHARS: buf_pack_indices_and_args.append(arg_idx) buf_pack_indices_and_args.append(repr(typechar.encode())) buf_pack_indices_and_args.append(expr_str) else: buf_indices_and_args.append(arg_idx) buf_indices_and_args.append(f"pack('{typechar}', {expr_str})") wait_for_parts = [] for arg_idx, arg_type in enumerate(arg_types): arg_var = "arg%d" % arg_idx if arg_type is None: gen_indices_and_args.append(cl_arg_idx) gen_indices_and_args.append(arg_var) cl_arg_idx += 1 gen("") continue elif isinstance(arg_type, VectorArg): if include_debug_code: gen(f"if not {arg_var}.flags.forc:") with Indentation(gen): gen("raise RuntimeError('only contiguous arrays may '") gen(" 'be used as arguments to this operation')") gen("") if in_enqueue and include_debug_code: gen(f"assert {arg_var}.queue is None or {arg_var}.queue == queue, " "'queues for all arrays must match the queue supplied " "to enqueue'") gen_indices_and_args.append(cl_arg_idx) gen_indices_and_args.append(f"{arg_var}.base_data") cl_arg_idx += 1 if arg_type.with_offset: add_buf_arg(cl_arg_idx, np.dtype(np.int64).char, f"{arg_var}.offset") cl_arg_idx += 1 if in_enqueue: wait_for_parts .append(f"{arg_var}.events") continue arg_dtype = np.dtype(arg_type) if arg_dtype.char == "V": buf_indices_and_args.append(cl_arg_idx) buf_indices_and_args.append(arg_var) cl_arg_idx += 1 elif arg_dtype.kind == "c": if warn_about_arg_count_bug: warn("{knl_name}: arguments include complex numbers, and " "some (but not all) of the target devices mishandle " "struct kernel arguments (hence the workaround is " "disabled".format(knl_name=function_name), stacklevel=2) if arg_dtype == np.complex64: arg_char = "f" elif arg_dtype == np.complex128: arg_char = "d" else: raise TypeError("unexpected complex type: %s" % arg_dtype) if (work_around_arg_count_bug == "pocl" and arg_dtype == np.complex128 and fp_arg_count + 2 <= 8): add_buf_arg(cl_arg_idx, arg_char, f"{arg_var}.real") cl_arg_idx += 1 add_buf_arg(cl_arg_idx, arg_char, f"{arg_var}.imag") cl_arg_idx += 1 elif (work_around_arg_count_bug == "apple" and arg_dtype == np.complex128 and fp_arg_count + 2 <= 8): raise NotImplementedError("No work-around to " "Apple's broken structs-as-kernel arg " "handling has been found. " "Cannot pass complex numbers to kernels.") else: buf_indices_and_args.append(cl_arg_idx) buf_indices_and_args.append( f"pack('{arg_char}{arg_char}', {arg_var}.real, {arg_var}.imag)") cl_arg_idx += 1 fp_arg_count += 2 else: if arg_dtype.kind == "f": fp_arg_count += 1 arg_char = arg_dtype.char arg_char = _type_char_map.get(arg_char, arg_char) add_buf_arg(cl_arg_idx, arg_char, arg_var) cl_arg_idx += 1 gen("") for arg_kind, args_and_indices, entry_length in [ ("", gen_indices_and_args, 2), ("_buf", buf_indices_and_args, 2), ("_buf_pack", buf_pack_indices_and_args, 3), ]: assert len(args_and_indices) % entry_length == 0 if args_and_indices: gen(f"self._set_arg{arg_kind}_multi(" f"({', '.join(str(i) for i in args_and_indices)},), " ")") if cl_arg_idx != num_cl_args: raise TypeError( "length of argument list (%d) and " "CL-generated number of arguments (%d) do not agree" % (cl_arg_idx, num_cl_args)) if in_enqueue: return gen, wait_for_parts else: return gen # }}} def _generate_enqueue_and_set_args_module(function_name, num_passed_args, num_cl_args, arg_types, include_debug_code, work_around_arg_count_bug, warn_about_arg_count_bug): arg_names = ["arg%d" % i for i in range(num_passed_args)] def gen_arg_setting(in_enqueue): if arg_types is None: result = generate_generic_arg_handling_body(num_passed_args) if in_enqueue: return result, [] else: return result else: return generate_specific_arg_handling_body( function_name, num_cl_args, arg_types, warn_about_arg_count_bug=warn_about_arg_count_bug, work_around_arg_count_bug=work_around_arg_count_bug, in_enqueue=in_enqueue, include_debug_code=include_debug_code) gen = PythonCodeGenerator() gen("from struct import pack") gen("from pyopencl import status_code") gen("import numpy as np") gen("import pyopencl._cl as _cl") gen("") # {{{ generate _enqueue from pytools import to_identifier enqueue_name = f"enqueue_knl_{to_identifier(function_name)}" gen("def %s(%s):" % (enqueue_name, ", ".join([ "self", "queue", "global_size", "local_size", *arg_names, "global_offset=None", "g_times_l=False", "allow_empty_ndrange=False", "wait_for=None"]))) with Indentation(gen): subgen, wait_for_parts = gen_arg_setting(in_enqueue=True) gen.extend(subgen) if wait_for_parts: wait_for_expr = ( "[*(() if wait_for is None else wait_for), " + ", ".join("*"+wfp for wfp in wait_for_parts) + "]") else: wait_for_expr = "wait_for" # Using positional args here because pybind is slow with keyword args gen(f""" return _cl.enqueue_nd_range_kernel(queue, self, global_size, local_size, global_offset, {wait_for_expr}, g_times_l, allow_empty_ndrange) """) # }}} # {{{ generate set_args gen("") gen("def set_args(%s):" % (", ".join(["self", *arg_names]))) with Indentation(gen): gen.extend(gen_arg_setting(in_enqueue=False)) # }}} return ( gen.get_picklable_module( name=f""), enqueue_name) # {{{ Helper functions related to argument sizes and device limits def _get_max_parameter_size(dev): """Return the device's maximum parameter size adjusted for PoCL.""" from pyopencl.characterize import get_pocl_version dev_limit = dev.max_parameter_size pocl_version = get_pocl_version(dev.platform, fallback_value=(1, 8)) if pocl_version is not None and pocl_version < (3, 0): # Current PoCL versions (as of 04/2022) have an incorrect parameter # size limit of 1024; see e.g. https://github.com/pocl/pocl/pull/1046 if dev_limit == 1024: if dev.type & cl.device_type.CPU: return 1024*1024 if dev.type & cl.device_type.GPU: # All modern Nvidia GPUs (starting from Compute Capability 2) # have this limit return 4352 return dev_limit def _check_arg_size(function_name, num_cl_args, arg_types, devs): """Check whether argument sizes exceed the OpenCL device limit.""" for dev in devs: dev_ptr_size = int(dev.address_bits / 8) dev_limit = _get_max_parameter_size(dev) total_arg_size = 0 is_estimate = False if arg_types: for arg_type in arg_types: if arg_type is None: is_estimate = True total_arg_size += dev_ptr_size elif isinstance(arg_type, VectorArg): total_arg_size += dev_ptr_size else: total_arg_size += np.dtype(arg_type).itemsize else: # Estimate that each argument has the size of a pointer on average is_estimate = True total_arg_size = dev_ptr_size * num_cl_args if total_arg_size > dev_limit: from warnings import warn warn(f"Kernel '{function_name}' has {num_cl_args} arguments with " f"a total size of {total_arg_size} bytes, which is higher than " f"the limit of {dev_limit} bytes on {dev}. This might " "lead to compilation errors, especially on GPU devices.", stacklevel=3) elif is_estimate and total_arg_size >= dev_limit * 0.75: # Since total_arg_size is just an estimate, also warn in case we are # just below the actual limit. from warnings import warn warn(f"Kernel '{function_name}' has {num_cl_args} arguments with " f"a total size of {total_arg_size} bytes, which approaches " f"the limit of {dev_limit} bytes on {dev}. This might " "lead to compilation errors, especially on GPU devices.", stacklevel=3) # }}} if not cl._PYOPENCL_NO_CACHE: from pytools.py_codegen import PicklableModule invoker_cache: WriteOncePersistentDict[Any, Tuple[PicklableModule, str]] \ = WriteOncePersistentDict( "pyopencl-invoker-cache-v42-nano", key_builder=_NumpyTypesKeyBuilder(), in_mem_cache_size=0, safe_sync=False) def generate_enqueue_and_set_args(function_name, num_passed_args, num_cl_args, arg_types, work_around_arg_count_bug, warn_about_arg_count_bug, devs): _check_arg_size(function_name, num_cl_args, arg_types, devs) cache_key = (function_name, num_passed_args, num_cl_args, arg_types, __debug__, work_around_arg_count_bug, warn_about_arg_count_bug) from_cache = False if not cl._PYOPENCL_NO_CACHE: try: pmod, enqueue_name = invoker_cache[cache_key] from_cache = True except KeyError: pass if not from_cache: pmod, enqueue_name = _generate_enqueue_and_set_args_module(*cache_key) if not cl._PYOPENCL_NO_CACHE: invoker_cache.store_if_not_present(cache_key, (pmod, enqueue_name)) return ( pmod.mod_globals[enqueue_name], pmod.mod_globals["set_args"]) # }}} # vim: foldmethod=marker pyopencl-2025.1/pyopencl/ipython_ext.py0000644000000000000000000000357314332717401015151 0ustar00from IPython.core.magic import Magics, cell_magic, line_magic, magics_class import pyopencl as cl @magics_class class PyOpenCLMagics(Magics): def _run_kernel(self, kernel, options): try: ctx = self.shell.user_ns["cl_ctx"] except KeyError: ctx = None if not isinstance(ctx, cl.Context): ctx = None if ctx is None: try: ctx = self.shell.user_ns["ctx"] except KeyError: ctx = None if ctx is None or not isinstance(ctx, cl.Context): raise RuntimeError("unable to locate cl context, which must be " "present in namespace as 'cl_ctx' or 'ctx'") prg = cl.Program(ctx, kernel).build(options=options.split()) for knl in prg.all_kernels(): self.shell.user_ns[knl.function_name] = knl @cell_magic def cl_kernel(self, line, cell): kernel = cell opts, _args = self.parse_options(line, "o:") build_options = opts.get("o", "") self._run_kernel(kernel, build_options) def _load_kernel_and_options(self, line): opts, args = self.parse_options(line, "o:f:") build_options = opts.get("o") kernel = self.shell.find_user_code(opts.get("f") or args) return kernel, build_options @line_magic def cl_kernel_from_file(self, line): kernel, build_options = self._load_kernel_and_options(line) self._run_kernel(kernel, build_options) @line_magic def cl_load_edit_kernel(self, line): kernel, build_options = self._load_kernel_and_options(line) header = "%%cl_kernel" if build_options: header = f'{header} -o "{build_options}"' content = f"{header}\n\n{kernel}" self.shell.set_next_input(content) def load_ipython_extension(ip): ip.register_magics(PyOpenCLMagics) pyopencl-2025.1/pyopencl/reduction.py0000644000000000000000000006167514332717401014602 0ustar00"""Computation of reductions on vectors.""" __copyright__ = "Copyright (C) 2010 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Based on code/ideas by Mark Harris . None of the original source code remains. """ from dataclasses import dataclass from typing import Any, List, Optional, Tuple, Union import numpy as np import pyopencl as cl from pyopencl.tools import ( DtypedArgument, KernelTemplateBase, _process_code_for_macro, context_dependent_memoize, dtype_to_ctype, ) # {{{ kernel source KERNEL = r"""//CL// #define PCL_GROUP_SIZE ${group_size} #define PCL_READ_AND_MAP(i) (${map_expr}) #define PCL_REDUCE(a, b) (${reduce_expr}) % if double_support: #if __OPENCL_C_VERSION__ < 120 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif #define PYOPENCL_DEFINE_CDOUBLE % endif #include ${preamble} typedef ${out_type} pcl_out_type; __kernel void ${name}( __global pcl_out_type *pcl_out__base, long pcl_out__offset, ${arguments} long pcl_start, long pcl_step, long pcl_stop, unsigned int pcl_seq_count, long n) { __global pcl_out_type *pcl_out = (__global pcl_out_type *) ( (__global char *) pcl_out__base + pcl_out__offset); ${arg_prep} __local pcl_out_type pcl_ldata[PCL_GROUP_SIZE]; unsigned int pcl_lid = get_local_id(0); const long pcl_base_idx = get_group_id(0)*PCL_GROUP_SIZE*pcl_seq_count + pcl_lid; long i = pcl_start + pcl_base_idx * pcl_step; pcl_out_type pcl_acc = ${neutral}; for (unsigned pcl_s = 0; pcl_s < pcl_seq_count; ++pcl_s) { if (i >= pcl_stop) break; pcl_acc = PCL_REDUCE(pcl_acc, PCL_READ_AND_MAP(i)); i += PCL_GROUP_SIZE*pcl_step; } pcl_ldata[pcl_lid] = pcl_acc; <% cur_size = group_size %> % while cur_size > 1: barrier(CLK_LOCAL_MEM_FENCE); <% new_size = cur_size // 2 assert new_size * 2 == cur_size %> if (pcl_lid < ${new_size}) { pcl_ldata[pcl_lid] = PCL_REDUCE( pcl_ldata[pcl_lid], pcl_ldata[pcl_lid + ${new_size}]); } <% cur_size = new_size %> % endwhile if (pcl_lid == 0) pcl_out[get_group_id(0)] = pcl_ldata[0]; } """ # }}} # {{{ internal codegen frontends @dataclass(frozen=True) class _ReductionInfo: context: cl.Context source: str group_size: int program: cl.Program kernel: cl.Kernel arg_types: List[DtypedArgument] def _get_reduction_source( ctx: cl.Context, out_type: str, out_type_size: int, neutral: str, reduce_expr: str, map_expr: str, parsed_args: List[DtypedArgument], name: str = "reduce_kernel", preamble: str = "", arg_prep: str = "", device: Optional[cl.Device] = None, max_group_size: Optional[int] = None) -> Tuple[str, int]: if device is not None: devices = [device] else: devices = ctx.devices # {{{ compute group size def get_dev_group_size(device: cl.Device) -> int: # dirty fix for the RV770 boards max_work_group_size = device.max_work_group_size if "RV770" in device.name: max_work_group_size = 64 # compute lmem limit from pytools import div_ceil lmem_wg_size = div_ceil(max_work_group_size, out_type_size) result = min(max_work_group_size, lmem_wg_size) # round down to power of 2 from pyopencl.tools import bitlog2 return 2**bitlog2(result) group_size = min(get_dev_group_size(dev) for dev in devices) if max_group_size is not None: group_size = min(max_group_size, group_size) # }}} from mako.template import Template from pyopencl.characterize import has_double_support arguments = ", ".join(arg.declarator() for arg in parsed_args) if parsed_args: arguments += ", " src = str(Template(KERNEL).render( out_type=out_type, group_size=group_size, arguments=arguments, neutral=neutral, reduce_expr=_process_code_for_macro(reduce_expr), map_expr=_process_code_for_macro(map_expr), name=name, preamble=preamble, arg_prep=arg_prep, double_support=all(has_double_support(dev) for dev in devices), )) return src, group_size def get_reduction_kernel( stage: int, ctx: cl.Context, dtype_out: Any, neutral: str, reduce_expr: str, map_expr: Optional[str] = None, arguments: Optional[List[DtypedArgument]] = None, name: str = "reduce_kernel", preamble: str = "", device: Optional[cl.Device] = None, options: Any = None, max_group_size: Optional[int] = None) -> _ReductionInfo: if stage not in (1, 2): raise ValueError(f"unknown stage index: '{stage}'") if map_expr is None: map_expr = "pyopencl_reduction_inp[i]" if stage == 2 else "in[i]" from pyopencl.tools import ( VectorArg, get_arg_list_scalar_arg_dtypes, get_arg_offset_adjuster_code, parse_arg_list, ) if arguments is None: raise ValueError("arguments must not be None") arguments = parse_arg_list(arguments, with_offset=True) arg_prep = get_arg_offset_adjuster_code(arguments) if stage == 2 and arguments is not None: arguments = [ VectorArg(dtype_out, "pyopencl_reduction_inp"), *arguments] source, group_size = _get_reduction_source( ctx, dtype_to_ctype(dtype_out), dtype_out.itemsize, neutral, reduce_expr, map_expr, arguments, name, preamble, arg_prep, device, max_group_size) program = cl.Program(ctx, source) program.build(options) kernel = getattr(program, name) kernel.set_scalar_arg_dtypes( [None, np.int64] + get_arg_list_scalar_arg_dtypes(arguments) + [np.int64]*3 + [np.uint32, np.int64] ) return _ReductionInfo( context=ctx, source=source, group_size=group_size, program=program, kernel=kernel, arg_types=arguments ) # }}} # {{{ main reduction kernel _MAX_GROUP_COUNT = 1024 _SMALL_SEQ_COUNT = 4 class ReductionKernel: """A kernel that performs a generic reduction on arrays. Generate a kernel that takes a number of scalar or vector *arguments* (at least one vector argument), performs the *map_expr* on each entry of the vector argument and then the *reduce_expr* on the outcome of that. *neutral* serves as an initial value. *preamble* offers the possibility to add preprocessor directives and other code (such as helper functions) to be added before the actual reduction kernel code. Vectors in *map_expr* should be indexed by the variable *i*. *reduce_expr* uses the formal values "a" and "b" to indicate two operands of a binary reduction operation. If you do not specify a *map_expr*, ``in[i]`` is automatically assumed and treated as the only one input argument. *dtype_out* specifies the :class:`numpy.dtype` in which the reduction is performed and in which the result is returned. *neutral* is specified as float or integer formatted as string. *reduce_expr* and *map_expr* are specified as string formatted operations and *arguments* is specified as a string formatted as a C argument list. *name* specifies the name as which the kernel is compiled. *options* are passed unmodified to :meth:`pyopencl.Program.build`. *preamble* specifies a string of code that is inserted before the actual kernels. .. automethod:: __init__ .. automethod:: __call__ """ def __init__( self, ctx: cl.Context, dtype_out: Any, neutral: str, reduce_expr: str, map_expr: Optional[str] = None, arguments: Optional[Union[str, List[DtypedArgument]]] = None, name: str = "reduce_kernel", options: Any = None, preamble: str = "") -> None: if arguments is None: raise ValueError("arguments must not be None") from pyopencl.tools import parse_arg_list arguments = parse_arg_list(arguments, with_offset=True) dtype_out = self.dtype_out = np.dtype(dtype_out) max_group_size = None trip_count = 0 while True: self.stage_1_inf = get_reduction_kernel(1, ctx, dtype_out, neutral, reduce_expr, map_expr, arguments, name=f"{name}_stage1", options=options, preamble=preamble, max_group_size=max_group_size) kernel_max_wg_size = self.stage_1_inf.kernel.get_work_group_info( cl.kernel_work_group_info.WORK_GROUP_SIZE, ctx.devices[0]) if self.stage_1_inf.group_size <= kernel_max_wg_size: break else: max_group_size = kernel_max_wg_size trip_count += 1 assert trip_count <= 2 self.stage_2_inf = get_reduction_kernel(2, ctx, dtype_out, neutral, reduce_expr, arguments=arguments, name=f"{name}_stage2", options=options, preamble=preamble, max_group_size=max_group_size) def __call__(self, *args: Any, **kwargs: Any) -> cl.Event: """Invoke the generated kernel. |explain-waitfor| With *out* the resulting single-entry :class:`pyopencl.array.Array` can be specified. Because offsets are supported one can store results anywhere (e.g. ``out=a[3]``). .. note:: The returned :class:`pyopencl.Event` corresponds only to part of the execution of the reduction. It is not suitable for profiling. .. versionadded:: 2011.1 .. versionchanged:: 2014.2 Added *out* parameter. .. versionchanged:: 2016.2 *range_* and *slice_* added. :arg range: A :class:`slice` object. Specifies the range of indices on which the kernel will be executed. May not be given at the same time as *slice*. :arg slice: A :class:`slice` object. Specifies the range of indices on which the kernel will be executed, relative to the first vector-like argument. May not be given at the same time as *range*. :arg return_event: a boolean flag used to return an event for the reduction. :return: the resulting scalar as a single-entry :class:`pyopencl.array.Array` if *return_event* is *False*, otherwise a tuple ``(scalar_array, event)``. """ queue = kwargs.pop("queue", None) allocator = kwargs.pop("allocator", None) wait_for = kwargs.pop("wait_for", None) return_event = kwargs.pop("return_event", False) out = kwargs.pop("out", None) range_ = kwargs.pop("range", None) slice_ = kwargs.pop("slice", None) if kwargs: raise TypeError("invalid keyword argument to reduction kernel") if wait_for is None: wait_for = [] else: # We'll be modifying it below. wait_for = list(wait_for) from pyopencl.array import empty stage_inf = self.stage_1_inf stage1_args = args while True: invocation_args = [] vectors = [] array_empty = empty from pyopencl.tools import VectorArg for arg, arg_tp in zip(args, stage_inf.arg_types): if isinstance(arg_tp, VectorArg): array_empty = arg.__class__ if not arg.flags.forc: raise RuntimeError( f"{type(self).__name__} cannot deal with " "non-contiguous arrays") vectors.append(arg) invocation_args.append(arg.base_data) if arg_tp.with_offset: invocation_args.append(arg.offset) wait_for.extend(arg.events) else: invocation_args.append(arg) if vectors: repr_vec = vectors[0] else: repr_vec = None # {{{ range/slice processing if range_ is not None: if slice_ is not None: raise TypeError("may not specify both range and slice " "keyword arguments") else: if slice_ is None: slice_ = slice(None) if repr_vec is None: raise TypeError( "must have vector argument when range is not specified") range_ = slice(*slice_.indices(repr_vec.size)) assert range_ is not None start = range_.start if start is None: start = 0 if range_.step is None: step = 1 else: step = range_.step sz = abs(range_.stop - start)//step # }}} if queue is not None: use_queue = queue else: if repr_vec is None: raise TypeError( "must specify queue argument when no vector argument present" ) use_queue = repr_vec.queue if allocator is None: if repr_vec is None: from pyopencl.tools import DeferredAllocator allocator = DeferredAllocator(queue.context) else: allocator = repr_vec.allocator if sz == 0: result = array_empty( use_queue, (), self.dtype_out, allocator=allocator) group_count = 1 seq_count = 0 elif sz <= stage_inf.group_size*_SMALL_SEQ_COUNT*_MAX_GROUP_COUNT: total_group_size = _SMALL_SEQ_COUNT*stage_inf.group_size group_count = (sz + total_group_size - 1) // total_group_size seq_count = _SMALL_SEQ_COUNT else: group_count = _MAX_GROUP_COUNT macrogroup_size = group_count*stage_inf.group_size seq_count = (sz + macrogroup_size - 1) // macrogroup_size size_args = [start, step, range_.stop, seq_count, sz] if group_count == 1 and out is not None: result = out elif group_count == 1: result = array_empty(use_queue, (), self.dtype_out, allocator=allocator) else: result = array_empty(use_queue, (group_count,), self.dtype_out, allocator=allocator) last_evt = stage_inf.kernel( use_queue, (group_count*stage_inf.group_size,), (stage_inf.group_size,), *([result.base_data, result.offset, *invocation_args, *size_args]), wait_for=wait_for) wait_for = [last_evt] result.add_event(last_evt) if group_count == 1: if return_event: return result, last_evt else: return result else: stage_inf = self.stage_2_inf args = (result, *stage1_args) range_ = slice_ = None # }}} # {{{ template class ReductionTemplate(KernelTemplateBase): def __init__( self, arguments: Union[str, List[DtypedArgument]], neutral: str, reduce_expr: str, map_expr: Optional[str] = None, is_segment_start_expr: Optional[str] = None, input_fetch_exprs: Optional[List[Tuple[str, str, int]]] = None, name_prefix: str = "reduce", preamble: str = "", template_processor: Any = None) -> None: super().__init__(template_processor=template_processor) if input_fetch_exprs is None: input_fetch_exprs = [] self.arguments = arguments self.reduce_expr = reduce_expr self.neutral = neutral self.map_expr = map_expr self.name_prefix = name_prefix self.preamble = preamble def build_inner(self, context, type_aliases=(), var_values=(), more_preamble="", more_arguments=(), declare_types=(), options=None, devices=None): renderer = self.get_renderer( type_aliases, var_values, context, options) arg_list = renderer.render_argument_list( self.arguments, more_arguments) type_decl_preamble = renderer.get_type_decl_preamble( context.devices[0], declare_types, arg_list) return ReductionKernel(context, renderer.type_aliases["reduction_t"], renderer(self.neutral), renderer(self.reduce_expr), renderer(self.map_expr), renderer.render_argument_list(self.arguments, more_arguments), name=renderer(self.name_prefix), options=options, preamble=( type_decl_preamble + "\n" + renderer(f"{self.preamble}\n{more_preamble}"))) # }}} # {{{ array reduction kernel getters @context_dependent_memoize def get_any_kernel(ctx, dtype_in): from pyopencl.tools import VectorArg return ReductionKernel(ctx, np.int8, "false", "a || b", map_expr="(bool) (in[i])", arguments=[VectorArg(dtype_in, "in")]) @context_dependent_memoize def get_all_kernel(ctx, dtype_in): from pyopencl.tools import VectorArg return ReductionKernel(ctx, np.int8, "true", "a && b", map_expr="(bool) (in[i])", arguments=[VectorArg(dtype_in, "in")]) @context_dependent_memoize def get_sum_kernel(ctx, dtype_out, dtype_in): if dtype_out is None: dtype_out = dtype_in reduce_expr = "a+b" neutral_expr = "0" if dtype_out.kind == "c": from pyopencl.elementwise import complex_dtype_to_name dtname = complex_dtype_to_name(dtype_out) reduce_expr = f"{dtname}_add(a, b)" neutral_expr = f"{dtname}_new(0, 0)" return ReductionKernel( ctx, dtype_out, neutral_expr, reduce_expr, arguments="const {} *in".format(dtype_to_ctype(dtype_in)), ) def _get_dot_expr(dtype_out, dtype_a, dtype_b, conjugate_first, has_double_support, index_expr="i"): if dtype_b is None: if dtype_a is None: dtype_b = dtype_out else: dtype_b = dtype_a if dtype_out is None: from pyopencl.compyte.array import get_common_dtype dtype_out = get_common_dtype( dtype_a.type(0), dtype_b.type(0), has_double_support) a_is_complex = dtype_a.kind == "c" b_is_complex = dtype_b.kind == "c" from pyopencl.elementwise import complex_dtype_to_name a = f"a[{index_expr}]" b = f"b[{index_expr}]" if a_is_complex and (dtype_a != dtype_out): a = "{}_cast({})".format(complex_dtype_to_name(dtype_out), a) if b_is_complex and (dtype_b != dtype_out): b = "{}_cast({})".format(complex_dtype_to_name(dtype_out), b) if a_is_complex and conjugate_first and a_is_complex: a = "{}_conj({})".format( complex_dtype_to_name(dtype_out), a) if a_is_complex and not b_is_complex: map_expr = "{}_mulr({}, {})".format(complex_dtype_to_name(dtype_out), a, b) elif not a_is_complex and b_is_complex: map_expr = "{}_rmul({}, {})".format(complex_dtype_to_name(dtype_out), a, b) elif a_is_complex and b_is_complex: map_expr = "{}_mul({}, {})".format(complex_dtype_to_name(dtype_out), a, b) else: map_expr = f"{a}*{b}" return map_expr, dtype_out, dtype_b @context_dependent_memoize def get_dot_kernel(ctx, dtype_out, dtype_a=None, dtype_b=None, conjugate_first=False): from pyopencl.characterize import has_double_support map_expr, dtype_out, dtype_b = _get_dot_expr( dtype_out, dtype_a, dtype_b, conjugate_first, has_double_support=has_double_support(ctx.devices[0])) reduce_expr = "a+b" neutral_expr = "0" if dtype_out.kind == "c": from pyopencl.elementwise import complex_dtype_to_name dtname = complex_dtype_to_name(dtype_out) reduce_expr = f"{dtname}_add(a, b)" neutral_expr = f"{dtname}_new(0, 0)" return ReductionKernel(ctx, dtype_out, neutral=neutral_expr, reduce_expr=reduce_expr, map_expr=map_expr, arguments=( "const {tp_a} *a, const {tp_b} *b".format( tp_a=dtype_to_ctype(dtype_a), tp_b=dtype_to_ctype(dtype_b), )) ) @context_dependent_memoize def get_subset_dot_kernel(ctx, dtype_out, dtype_subset, dtype_a=None, dtype_b=None, conjugate_first=False): from pyopencl.characterize import has_double_support map_expr, dtype_out, dtype_b = _get_dot_expr( dtype_out, dtype_a, dtype_b, conjugate_first, has_double_support=has_double_support(ctx.devices[0]), index_expr="lookup_tbl[i]") # important: lookup_tbl must be first--it controls the length return ReductionKernel(ctx, dtype_out, neutral="0", reduce_expr="a+b", map_expr=map_expr, arguments=( "const {tp_lut} *lookup_tbl, const {tp_a} *a, const {tp_b} *b" .format( tp_lut=dtype_to_ctype(dtype_subset), tp_a=dtype_to_ctype(dtype_a), tp_b=dtype_to_ctype(dtype_b), )) ) _MINMAX_PREAMBLE = """ #define MY_INFINITY (1./0) #define fmin_nanprop(a, b) (isnan(a) || isnan(b)) ? a+b : fmin(a, b) #define fmax_nanprop(a, b) (isnan(a) || isnan(b)) ? a+b : fmax(a, b) """ def get_minmax_neutral(what, dtype): dtype = np.dtype(dtype) if issubclass(dtype.type, np.inexact): if what == "min": return "MY_INFINITY" elif what == "max": return "-MY_INFINITY" else: raise ValueError("what is not min or max.") else: if what == "min": return str(np.iinfo(dtype).max) elif what == "max": return str(np.iinfo(dtype).min) else: raise ValueError("what is not min or max.") @context_dependent_memoize def get_minmax_kernel(ctx, what, dtype): if dtype.kind == "f": reduce_expr = f"f{what}_nanprop(a,b)" elif dtype.kind in "iu": reduce_expr = f"{what}(a,b)" else: raise TypeError("unsupported dtype specified") return ReductionKernel(ctx, dtype, neutral=get_minmax_neutral(what, dtype), reduce_expr=f"{reduce_expr}", arguments="const {tp} *in".format( tp=dtype_to_ctype(dtype), ), preamble=_MINMAX_PREAMBLE) @context_dependent_memoize def get_subset_minmax_kernel(ctx, what, dtype, dtype_subset): if dtype.kind == "f": reduce_expr = f"f{what}(a, b)" elif dtype.kind in "iu": reduce_expr = f"{what}(a, b)" else: raise TypeError("unsupported dtype specified") return ReductionKernel(ctx, dtype, neutral=get_minmax_neutral(what, dtype), reduce_expr=f"{reduce_expr}", map_expr="in[lookup_tbl[i]]", arguments=( "const {tp_lut} *lookup_tbl, " "const {tp} *in".format( tp=dtype_to_ctype(dtype), tp_lut=dtype_to_ctype(dtype_subset), )), preamble=_MINMAX_PREAMBLE) # }}} # vim: filetype=pyopencl:fdm=marker pyopencl-2025.1/pyopencl/scan.py0000644000000000000000000020014614332717401013516 0ustar00"""Scan primitive.""" __copyright__ = """ Copyright 2011-2012 Andreas Kloeckner Copyright 2008-2011 NVIDIA Corporation """ __license__ = """ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Derived from code within the Thrust project, https://github.com/NVIDIA/thrust """ import logging from abc import ABC, abstractmethod from dataclasses import dataclass from typing import Any, Dict, List, Optional, Set, Tuple, Union import numpy as np from pytools.persistent_dict import WriteOncePersistentDict import pyopencl as cl import pyopencl._mymako as mako import pyopencl.array from pyopencl._cluda import CLUDA_PREAMBLE from pyopencl.tools import ( DtypedArgument, KernelTemplateBase, _NumpyTypesKeyBuilder, _process_code_for_macro, bitlog2, context_dependent_memoize, dtype_to_ctype, get_arg_list_scalar_arg_dtypes, get_arg_offset_adjuster_code, ) logger = logging.getLogger(__name__) # {{{ preamble SHARED_PREAMBLE = CLUDA_PREAMBLE + """//CL// #define WG_SIZE ${wg_size} #define SCAN_EXPR(a, b, across_seg_boundary) ${scan_expr} #define INPUT_EXPR(i) (${input_expr}) %if is_segmented: #define IS_SEG_START(i, a) (${is_segment_start_expr}) %endif ${preamble} typedef ${dtype_to_ctype(scan_dtype)} scan_type; typedef ${dtype_to_ctype(index_dtype)} index_type; // NO_SEG_BOUNDARY is the largest representable integer in index_type. // This assumption is used in code below. #define NO_SEG_BOUNDARY ${str(np.iinfo(index_dtype).max)} """ # }}} # {{{ main scan code # Algorithm: Each work group is responsible for one contiguous # 'interval'. There are just enough intervals to fill all compute # units. Intervals are split into 'units'. A unit is what gets # worked on in parallel by one work group. # # in index space: # interval > unit > local-parallel > k-group # # (Note that there is also a transpose in here: The data is read # with local ids along linear index order.) # # Each unit has two axes--the local-id axis and the k axis. # # unit 0: # | | | | | | | | | | ----> lid # | | | | | | | | | | # | | | | | | | | | | # | | | | | | | | | | # | | | | | | | | | | # # | # v k (fastest-moving in linear index) # # unit 1: # | | | | | | | | | | ----> lid # | | | | | | | | | | # | | | | | | | | | | # | | | | | | | | | | # | | | | | | | | | | # # | # v k (fastest-moving in linear index) # # ... # # At a device-global level, this is a three-phase algorithm, in # which first each interval does its local scan, then a scan # across intervals exchanges data globally, and the final update # adds the exchanged sums to each interval. # # Exclusive scan is realized by allowing look-behind (access to the # preceding item) in the final update, by means of a local shift. # # NOTE: All segment_start_in_X indices are relative to the start # of the array. SCAN_INTERVALS_SOURCE = SHARED_PREAMBLE + r"""//CL// #define K ${k_group_size} // #define DEBUG #ifdef DEBUG #define pycl_printf(ARGS) printf ARGS #else #define pycl_printf(ARGS) /* */ #endif KERNEL REQD_WG_SIZE(WG_SIZE, 1, 1) void ${kernel_name}( ${argument_signature}, GLOBAL_MEM scan_type *restrict partial_scan_buffer, const index_type N, const index_type interval_size %if is_first_level: , GLOBAL_MEM scan_type *restrict interval_results %endif %if is_segmented and is_first_level: // NO_SEG_BOUNDARY if no segment boundary in interval. , GLOBAL_MEM index_type *restrict g_first_segment_start_in_interval %endif %if store_segment_start_flags: , GLOBAL_MEM char *restrict g_segment_start_flags %endif ) { ${arg_offset_adjustment} // index K in first dimension used for carry storage %if use_bank_conflict_avoidance: // Avoid bank conflicts by adding a single 32-bit value to the size of // the scan type. struct __attribute__ ((__packed__)) wrapped_scan_type { scan_type value; int dummy; }; %else: struct wrapped_scan_type { scan_type value; }; %endif // padded in WG_SIZE to avoid bank conflicts LOCAL_MEM struct wrapped_scan_type ldata[K + 1][WG_SIZE + 1]; %if is_segmented: LOCAL_MEM char l_segment_start_flags[K][WG_SIZE]; LOCAL_MEM index_type l_first_segment_start_in_subtree[WG_SIZE]; // only relevant/populated for local id 0 index_type first_segment_start_in_interval = NO_SEG_BOUNDARY; index_type first_segment_start_in_k_group, first_segment_start_in_subtree; %endif // {{{ declare local data for input_fetch_exprs if any of them are stenciled <% fetch_expr_offsets = {} for name, arg_name, ife_offset in input_fetch_exprs: fetch_expr_offsets.setdefault(arg_name, set()).add(ife_offset) local_fetch_expr_args = set( arg_name for arg_name, ife_offsets in fetch_expr_offsets.items() if -1 in ife_offsets or len(ife_offsets) > 1) %> %for arg_name in local_fetch_expr_args: LOCAL_MEM ${arg_ctypes[arg_name]} l_${arg_name}[WG_SIZE*K]; %endfor // }}} const index_type interval_begin = interval_size * GID_0; const index_type interval_end = min(interval_begin + interval_size, N); const index_type unit_size = K * WG_SIZE; index_type unit_base = interval_begin; %for is_tail in [False, True]: %if not is_tail: for(; unit_base + unit_size <= interval_end; unit_base += unit_size) %else: if (unit_base < interval_end) %endif { // {{{ carry out input_fetch_exprs // (if there are ones that need to be fetched into local) %if local_fetch_expr_args: for(index_type k = 0; k < K; k++) { const index_type offset = k*WG_SIZE + LID_0; const index_type read_i = unit_base + offset; %for arg_name in local_fetch_expr_args: %if is_tail: if (read_i < interval_end) %endif { l_${arg_name}[offset] = ${arg_name}[read_i]; } %endfor } local_barrier(); %endif pycl_printf(("after input_fetch_exprs\n")); // }}} // {{{ read a unit's worth of data from global for(index_type k = 0; k < K; k++) { const index_type offset = k*WG_SIZE + LID_0; const index_type read_i = unit_base + offset; %if is_tail: if (read_i < interval_end) %endif { %for name, arg_name, ife_offset in input_fetch_exprs: ${arg_ctypes[arg_name]} ${name}; %if arg_name in local_fetch_expr_args: if (offset + ${ife_offset} >= 0) ${name} = l_${arg_name}[offset + ${ife_offset}]; else if (read_i + ${ife_offset} >= 0) ${name} = ${arg_name}[read_i + ${ife_offset}]; /* else if out of bounds, name is left undefined */ %else: // ${arg_name} gets fetched directly from global ${name} = ${arg_name}[read_i]; %endif %endfor scan_type scan_value = INPUT_EXPR(read_i); const index_type o_mod_k = offset % K; const index_type o_div_k = offset / K; ldata[o_mod_k][o_div_k].value = scan_value; %if is_segmented: bool is_seg_start = IS_SEG_START(read_i, scan_value); l_segment_start_flags[o_mod_k][o_div_k] = is_seg_start; %endif %if store_segment_start_flags: g_segment_start_flags[read_i] = is_seg_start; %endif } } pycl_printf(("after read from global\n")); // }}} // {{{ carry in from previous unit, if applicable %if is_segmented: local_barrier(); first_segment_start_in_k_group = NO_SEG_BOUNDARY; if (l_segment_start_flags[0][LID_0]) first_segment_start_in_k_group = unit_base + K*LID_0; %endif if (LID_0 == 0 && unit_base != interval_begin) { scan_type tmp = ldata[K][WG_SIZE - 1].value; scan_type tmp_aux = ldata[0][0].value; ldata[0][0].value = SCAN_EXPR( tmp, tmp_aux, %if is_segmented: (l_segment_start_flags[0][0]) %else: false %endif ); } pycl_printf(("after carry-in\n")); // }}} local_barrier(); // {{{ scan along k (sequentially in each work item) scan_type sum = ldata[0][LID_0].value; %if is_tail: const index_type offset_end = interval_end - unit_base; %endif for (index_type k = 1; k < K; k++) { %if is_tail: if ((index_type) (K * LID_0 + k) < offset_end) %endif { scan_type tmp = ldata[k][LID_0].value; %if is_segmented: index_type seq_i = unit_base + K*LID_0 + k; if (l_segment_start_flags[k][LID_0]) { first_segment_start_in_k_group = min( first_segment_start_in_k_group, seq_i); } %endif sum = SCAN_EXPR(sum, tmp, %if is_segmented: (l_segment_start_flags[k][LID_0]) %else: false %endif ); ldata[k][LID_0].value = sum; } } pycl_printf(("after scan along k\n")); // }}} // store carry in out-of-bounds (padding) array entry (index K) in // the K direction ldata[K][LID_0].value = sum; %if is_segmented: l_first_segment_start_in_subtree[LID_0] = first_segment_start_in_k_group; %endif local_barrier(); // {{{ tree-based local parallel scan // This tree-based scan works as follows: // - Each work item adds the previous item to its current state // - barrier // - Each work item adds in the item from two positions to the left // - barrier // - Each work item adds in the item from four positions to the left // ... // At the end, each item has summed all prior items. // across k groups, along local id // (uses out-of-bounds k=K array entry for storage) scan_type val = ldata[K][LID_0].value; <% scan_offset = 1 %> % while scan_offset <= wg_size: // {{{ reads from local allowed, writes to local not allowed if (LID_0 >= ${scan_offset}) { scan_type tmp = ldata[K][LID_0 - ${scan_offset}].value; % if is_tail: if (K*LID_0 < offset_end) % endif { val = SCAN_EXPR(tmp, val, %if is_segmented: (l_first_segment_start_in_subtree[LID_0] != NO_SEG_BOUNDARY) %else: false %endif ); } %if is_segmented: // Prepare for l_first_segment_start_in_subtree, below. // Note that this update must take place *even* if we're // out of bounds. first_segment_start_in_subtree = min( l_first_segment_start_in_subtree[LID_0], l_first_segment_start_in_subtree [LID_0 - ${scan_offset}]); %endif } %if is_segmented: else { first_segment_start_in_subtree = l_first_segment_start_in_subtree[LID_0]; } %endif // }}} local_barrier(); // {{{ writes to local allowed, reads from local not allowed ldata[K][LID_0].value = val; %if is_segmented: l_first_segment_start_in_subtree[LID_0] = first_segment_start_in_subtree; %endif // }}} local_barrier(); %if 0: if (LID_0 == 0) { printf("${scan_offset}: "); for (int i = 0; i < WG_SIZE; ++i) { if (l_first_segment_start_in_subtree[i] == NO_SEG_BOUNDARY) printf("- "); else printf("%d ", l_first_segment_start_in_subtree[i]); } printf("\n"); } %endif <% scan_offset *= 2 %> % endwhile pycl_printf(("after tree scan\n")); // }}} // {{{ update local values if (LID_0 > 0) { sum = ldata[K][LID_0 - 1].value; for(index_type k = 0; k < K; k++) { %if is_tail: if (K * LID_0 + k < offset_end) %endif { scan_type tmp = ldata[k][LID_0].value; ldata[k][LID_0].value = SCAN_EXPR(sum, tmp, %if is_segmented: (unit_base + K * LID_0 + k >= first_segment_start_in_k_group) %else: false %endif ); } } } %if is_segmented: if (LID_0 == 0) { // update interval-wide first-seg variable from current unit first_segment_start_in_interval = min( first_segment_start_in_interval, l_first_segment_start_in_subtree[WG_SIZE-1]); } %endif pycl_printf(("after local update\n")); // }}} local_barrier(); // {{{ write data %if is_gpu: { // work hard with index math to achieve contiguous 32-bit stores __global int *dest = (__global int *) (partial_scan_buffer + unit_base); <% assert scan_dtype.itemsize % 4 == 0 ints_per_wg = wg_size ints_to_store = scan_dtype.itemsize*wg_size*k_group_size // 4 %> const index_type scan_types_per_int = ${scan_dtype.itemsize//4}; %for store_base in range(0, ints_to_store, ints_per_wg): <% # Observe that ints_to_store is divisible by the work group # size already, so we won't go out of bounds that way. assert store_base + ints_per_wg <= ints_to_store %> %if is_tail: if (${store_base} + LID_0 < scan_types_per_int*(interval_end - unit_base)) %endif { index_type linear_index = ${store_base} + LID_0; index_type linear_scan_data_idx = linear_index / scan_types_per_int; index_type remainder = linear_index - linear_scan_data_idx * scan_types_per_int; __local int *src = (__local int *) &( ldata [linear_scan_data_idx % K] [linear_scan_data_idx / K].value); dest[linear_index] = src[remainder]; } %endfor } %else: for (index_type k = 0; k < K; k++) { const index_type offset = k*WG_SIZE + LID_0; %if is_tail: if (unit_base + offset < interval_end) %endif { pycl_printf(("write: %d\n", unit_base + offset)); partial_scan_buffer[unit_base + offset] = ldata[offset % K][offset / K].value; } } %endif pycl_printf(("after write\n")); // }}} local_barrier(); } % endfor // write interval sum %if is_first_level: if (LID_0 == 0) { interval_results[GID_0] = partial_scan_buffer[interval_end - 1]; %if is_segmented: g_first_segment_start_in_interval[GID_0] = first_segment_start_in_interval; %endif } %endif } """ # }}} # {{{ update UPDATE_SOURCE = SHARED_PREAMBLE + r"""//CL// KERNEL REQD_WG_SIZE(WG_SIZE, 1, 1) void ${name_prefix}_final_update( ${argument_signature}, const index_type N, const index_type interval_size, GLOBAL_MEM scan_type *restrict interval_results, GLOBAL_MEM scan_type *restrict partial_scan_buffer %if is_segmented: , GLOBAL_MEM index_type *restrict g_first_segment_start_in_interval %endif %if is_segmented and use_lookbehind_update: , GLOBAL_MEM char *restrict g_segment_start_flags %endif ) { ${arg_offset_adjustment} %if use_lookbehind_update: LOCAL_MEM scan_type ldata[WG_SIZE]; %endif %if is_segmented and use_lookbehind_update: LOCAL_MEM char l_segment_start_flags[WG_SIZE]; %endif const index_type interval_begin = interval_size * GID_0; const index_type interval_end = min(interval_begin + interval_size, N); // carry from last interval scan_type carry = ${neutral}; if (GID_0 != 0) carry = interval_results[GID_0 - 1]; %if is_segmented: const index_type first_seg_start_in_interval = g_first_segment_start_in_interval[GID_0]; %endif %if not is_segmented and 'last_item' in output_statement: scan_type last_item = interval_results[GDIM_0-1]; %endif %if not use_lookbehind_update: // {{{ no look-behind ('prev_item' not in output_statement -> simpler) index_type update_i = interval_begin+LID_0; %if is_segmented: index_type seg_end = min(first_seg_start_in_interval, interval_end); %endif for(; update_i < interval_end; update_i += WG_SIZE) { scan_type partial_val = partial_scan_buffer[update_i]; scan_type item = SCAN_EXPR(carry, partial_val, %if is_segmented: (update_i >= seg_end) %else: false %endif ); index_type i = update_i; { ${output_statement}; } } // }}} %else: // {{{ allow look-behind ('prev_item' in output_statement -> complicated) // We are not allowed to branch across barriers at a granularity smaller // than the whole workgroup. Therefore, the for loop is group-global, // and there are lots of local ifs. index_type group_base = interval_begin; scan_type prev_item = carry; // (A) for(; group_base < interval_end; group_base += WG_SIZE) { index_type update_i = group_base+LID_0; // load a work group's worth of data if (update_i < interval_end) { scan_type tmp = partial_scan_buffer[update_i]; tmp = SCAN_EXPR(carry, tmp, %if is_segmented: (update_i >= first_seg_start_in_interval) %else: false %endif ); ldata[LID_0] = tmp; %if is_segmented: l_segment_start_flags[LID_0] = g_segment_start_flags[update_i]; %endif } local_barrier(); // find prev_item if (LID_0 != 0) prev_item = ldata[LID_0 - 1]; /* else prev_item = carry (see (A)) OR last tail (see (B)); */ if (update_i < interval_end) { %if is_segmented: if (l_segment_start_flags[LID_0]) prev_item = ${neutral}; %endif scan_type item = ldata[LID_0]; index_type i = update_i; { ${output_statement}; } } if (LID_0 == 0) prev_item = ldata[WG_SIZE - 1]; // (B) local_barrier(); } // }}} %endif } """ # }}} # {{{ driver # {{{ helpers def _round_down_to_power_of_2(val: int) -> int: result = 2**bitlog2(val) if result > val: result >>= 1 assert result <= val return result _PREFIX_WORDS = set(""" ldata partial_scan_buffer global scan_offset segment_start_in_k_group carry g_first_segment_start_in_interval IS_SEG_START tmp Z val l_first_segment_start_in_subtree unit_size index_type interval_begin interval_size offset_end K SCAN_EXPR do_update WG_SIZE first_segment_start_in_k_group scan_type segment_start_in_subtree offset interval_results interval_end first_segment_start_in_subtree unit_base first_segment_start_in_interval k INPUT_EXPR prev_group_sum prev pv value partial_val pgs is_seg_start update_i scan_item_at_i seq_i read_i l_ o_mod_k o_div_k l_segment_start_flags scan_value sum first_seg_start_in_interval g_segment_start_flags group_base seg_end my_val DEBUG ARGS ints_to_store ints_per_wg scan_types_per_int linear_index linear_scan_data_idx dest src store_base wrapped_scan_type dummy scan_tmp tmp_aux LID_2 LID_1 LID_0 LDIM_0 LDIM_1 LDIM_2 GDIM_0 GDIM_1 GDIM_2 GID_0 GID_1 GID_2 """.split()) _IGNORED_WORDS = set(""" 4 8 32 typedef for endfor if void while endwhile endfor endif else const printf None return bool n char true false ifdef pycl_printf str range assert np iinfo max itemsize __packed__ struct restrict ptrdiff_t set iteritems len setdefault GLOBAL_MEM LOCAL_MEM_ARG WITHIN_KERNEL LOCAL_MEM KERNEL REQD_WG_SIZE local_barrier CLK_LOCAL_MEM_FENCE OPENCL EXTENSION pragma __attribute__ __global __kernel __local get_local_size get_local_id cl_khr_fp64 reqd_work_group_size get_num_groups barrier get_group_id CL_VERSION_1_1 __OPENCL_C_VERSION__ 120 _final_update _debug_scan kernel_name positions all padded integer its previous write based writes 0 has local worth scan_expr to read cannot not X items False bank four beginning follows applicable item min each indices works side scanning right summed relative used id out index avoid current state boundary True across be This reads groups along Otherwise undetermined store of times prior s update first regardless Each number because array unit from segment conflicts two parallel 2 empty define direction CL padding work tree bounds values and adds scan is allowed thus it an as enable at in occur sequentially end no storage data 1 largest may representable uses entry Y meaningful computations interval At the left dimension know d A load B group perform shift tail see last OR this add fetched into are directly need gets them stenciled that undefined there up any ones or name only relevant populated even wide we Prepare int seg Note re below place take variable must intra Therefore find code assumption branch workgroup complicated granularity phase remainder than simpler We smaller look ifs lots self behind allow barriers whole loop after already Observe achieve contiguous stores hard go with by math size won t way divisible bit so Avoid declare adding single type is_tail is_first_level input_expr argument_signature preamble double_support neutral output_statement k_group_size name_prefix is_segmented index_dtype scan_dtype wg_size is_segment_start_expr fetch_expr_offsets arg_ctypes ife_offsets input_fetch_exprs def ife_offset arg_name local_fetch_expr_args update_body update_loop_lookbehind update_loop_plain update_loop use_lookbehind_update store_segment_start_flags update_loop first_seg scan_dtype dtype_to_ctype is_gpu use_bank_conflict_avoidance a b prev_item i last_item prev_value N NO_SEG_BOUNDARY across_seg_boundary arg_offset_adjustment """.split()) def _make_template(s: str): import re leftovers = set() def replace_id(match: "re.Match") -> str: # avoid name clashes with user code by adding 'psc_' prefix to # identifiers. word = match.group(1) if word in _IGNORED_WORDS: return word elif word in _PREFIX_WORDS: return f"psc_{word}" else: leftovers.add(word) return word s = re.sub(r"\b([a-zA-Z0-9_]+)\b", replace_id, s) if leftovers: from warnings import warn warn("Leftover words in identifier prefixing: " + " ".join(leftovers), stacklevel=3) return mako.template.Template(s, strict_undefined=True) # type: ignore @dataclass(frozen=True) class _GeneratedScanKernelInfo: scan_src: str kernel_name: str scalar_arg_dtypes: List[Optional[np.dtype]] wg_size: int k_group_size: int def build(self, context: cl.Context, options: Any) -> "_BuiltScanKernelInfo": program = cl.Program(context, self.scan_src).build(options) kernel = getattr(program, self.kernel_name) kernel.set_scalar_arg_dtypes(self.scalar_arg_dtypes) return _BuiltScanKernelInfo( kernel=kernel, wg_size=self.wg_size, k_group_size=self.k_group_size) @dataclass(frozen=True) class _BuiltScanKernelInfo: kernel: cl.Kernel wg_size: int k_group_size: int @dataclass(frozen=True) class _GeneratedFinalUpdateKernelInfo: source: str kernel_name: str scalar_arg_dtypes: List[Optional[np.dtype]] update_wg_size: int def build(self, context: cl.Context, options: Any) -> "_BuiltFinalUpdateKernelInfo": program = cl.Program(context, self.source).build(options) kernel = getattr(program, self.kernel_name) kernel.set_scalar_arg_dtypes(self.scalar_arg_dtypes) return _BuiltFinalUpdateKernelInfo(kernel, self.update_wg_size) @dataclass(frozen=True) class _BuiltFinalUpdateKernelInfo: kernel: cl.Kernel update_wg_size: int # }}} class ScanPerformanceWarning(UserWarning): pass class GenericScanKernelBase(ABC): # {{{ constructor, argument processing def __init__( self, ctx: cl.Context, dtype: Any, arguments: Union[str, List[DtypedArgument]], input_expr: str, scan_expr: str, neutral: Optional[str], output_statement: str, is_segment_start_expr: Optional[str] = None, input_fetch_exprs: Optional[List[Tuple[str, str, int]]] = None, index_dtype: Any = None, name_prefix: str = "scan", options: Any = None, preamble: str = "", devices: Optional[cl.Device] = None) -> None: """ :arg ctx: a :class:`pyopencl.Context` within which the code for this scan kernel will be generated. :arg dtype: the :class:`numpy.dtype` with which the scan will be performed. May be a structured type if that type was registered through :func:`pyopencl.tools.get_or_register_dtype`. :arg arguments: A string of comma-separated C argument declarations. If *arguments* is specified, then *input_expr* must also be specified. All types used here must be known to PyOpenCL. (see :func:`pyopencl.tools.get_or_register_dtype`). :arg scan_expr: The associative, binary operation carrying out the scan, represented as a C string. Its two arguments are available as ``a`` and ``b`` when it is evaluated. ``b`` is guaranteed to be the 'element being updated', and ``a`` is the increment. Thus, if some data is supposed to just propagate along without being modified by the scan, it should live in ``b``. This expression may call functions given in the *preamble*. Another value available to this expression is ``across_seg_boundary``, a C `bool` indicating whether this scan update is crossing a segment boundary, as defined by ``is_segment_start_expr``. The scan routine does not implement segmentation semantics on its own. It relies on ``scan_expr`` to do this. This value is available (but always ``false``) even for a non-segmented scan. .. note:: In early pre-releases of the segmented scan, segmentation semantics were implemented *without* relying on ``scan_expr``. :arg input_expr: A C expression, encoded as a string, resulting in the values to which the scan is applied. This may be used to apply a mapping to values stored in *arguments* before being scanned. The result of this expression must match *dtype*. The index intended to be mapped is available as ``i`` in this expression. This expression may also use the variables defined by *input_fetch_expr*. This expression may also call functions given in the *preamble*. :arg output_statement: a C statement that writes the output of the scan. It has access to the scan result as ``item``, the preceding scan result item as ``prev_item``, and the current index as ``i``. ``prev_item`` in a segmented scan will be the neutral element at a segment boundary, not the immediately preceding item. Using *prev_item* in output statement has a small run-time cost. ``prev_item`` enables the construction of an exclusive scan. For non-segmented scans, *output_statement* may also reference ``last_item``, which evaluates to the scan result of the last array entry. :arg is_segment_start_expr: A C expression, encoded as a string, resulting in a C ``bool`` value that determines whether a new scan segments starts at index *i*. If given, makes the scan a segmented scan. Has access to the current index ``i``, the result of *input_expr* as ``a``, and in addition may use *arguments* and *input_fetch_expr* variables just like *input_expr*. If it returns true, then previous sums will not spill over into the item with index *i* or subsequent items. :arg input_fetch_exprs: a list of tuples *(NAME, ARG_NAME, OFFSET)*. An entry here has the effect of doing the equivalent of the following before input_expr:: ARG_NAME_TYPE NAME = ARG_NAME[i+OFFSET]; ``OFFSET`` is allowed to be 0 or -1, and ``ARG_NAME_TYPE`` is the type of ``ARG_NAME``. :arg preamble: |preamble| The first array in the argument list determines the size of the index space over which the scan is carried out, and thus the values over which the index *i* occurring in a number of code fragments in arguments above will vary. All code fragments further have access to N, the number of elements being processed in the scan. """ if index_dtype is None: index_dtype = np.dtype(np.int32) if input_fetch_exprs is None: input_fetch_exprs = [] self.context = ctx dtype = self.dtype = np.dtype(dtype) if neutral is None: from warnings import warn warn("not specifying 'neutral' is deprecated and will lead to " "wrong results if your scan is not in-place or your " "'output_statement' does something otherwise non-trivial", stacklevel=2) if dtype.itemsize % 4 != 0: raise TypeError("scan value type must have size divisible by 4 bytes") self.index_dtype = np.dtype(index_dtype) if np.iinfo(self.index_dtype).min >= 0: raise TypeError("index_dtype must be signed") if devices is None: devices = ctx.devices self.devices = devices self.options = options from pyopencl.tools import parse_arg_list self.parsed_args = parse_arg_list(arguments) from pyopencl.tools import VectorArg self.first_array_idx = next( i for i, arg in enumerate(self.parsed_args) if isinstance(arg, VectorArg)) self.input_expr = input_expr self.is_segment_start_expr = is_segment_start_expr self.is_segmented = is_segment_start_expr is not None if self.is_segmented: is_segment_start_expr = _process_code_for_macro(is_segment_start_expr) self.output_statement = output_statement for _name, _arg_name, ife_offset in input_fetch_exprs: if ife_offset not in [0, -1]: raise RuntimeError("input_fetch_expr offsets must either be 0 or -1") self.input_fetch_exprs = input_fetch_exprs arg_dtypes = {} arg_ctypes = {} for arg in self.parsed_args: arg_dtypes[arg.name] = arg.dtype arg_ctypes[arg.name] = dtype_to_ctype(arg.dtype) self.name_prefix = name_prefix # {{{ set up shared code dict from pyopencl.characterize import has_double_support self.code_variables = { "np": np, "dtype_to_ctype": dtype_to_ctype, "preamble": preamble, "name_prefix": name_prefix, "index_dtype": self.index_dtype, "scan_dtype": dtype, "is_segmented": self.is_segmented, "arg_dtypes": arg_dtypes, "arg_ctypes": arg_ctypes, "scan_expr": _process_code_for_macro(scan_expr), "neutral": _process_code_for_macro(neutral), "is_gpu": bool(self.devices[0].type & cl.device_type.GPU), "double_support": all( has_double_support(dev) for dev in devices), } index_typename = dtype_to_ctype(self.index_dtype) scan_typename = dtype_to_ctype(dtype) # This key is meant to uniquely identify the non-device parameters for # the scan kernel. self.kernel_key = ( self.dtype, tuple(arg.declarator() for arg in self.parsed_args), self.input_expr, scan_expr, neutral, output_statement, is_segment_start_expr, tuple(input_fetch_exprs), index_dtype, name_prefix, preamble, # These depend on dtype_to_ctype(), so their value is independent of # the other variables. index_typename, scan_typename, ) # }}} self.use_lookbehind_update = "prev_item" in self.output_statement self.store_segment_start_flags = ( self.is_segmented and self.use_lookbehind_update) self.finish_setup() # }}} @abstractmethod def finish_setup(self) -> None: pass if not cl._PYOPENCL_NO_CACHE: generic_scan_kernel_cache: WriteOncePersistentDict[Any, Tuple[_GeneratedScanKernelInfo, _GeneratedScanKernelInfo, _GeneratedFinalUpdateKernelInfo]] = \ WriteOncePersistentDict( "pyopencl-generated-scan-kernel-cache-v1", key_builder=_NumpyTypesKeyBuilder(), in_mem_cache_size=0, safe_sync=False) class GenericScanKernel(GenericScanKernelBase): """Generates and executes code that performs prefix sums ("scans") on arbitrary types, with many possible tweaks. Usage example:: from pyopencl.scan import GenericScanKernel knl = GenericScanKernel( context, np.int32, arguments="__global int *ary", input_expr="ary[i]", scan_expr="a+b", neutral="0", output_statement="ary[i+1] = item;") a = cl.array.arange(queue, 10000, dtype=np.int32) knl(a, queue=queue) .. automethod:: __init__ .. automethod:: __call__ """ def finish_setup(self) -> None: # Before generating the kernel, see if it's cached. from pyopencl.cache import get_device_cache_id devices_key = tuple(get_device_cache_id(device) for device in self.devices) cache_key = (self.kernel_key, devices_key) from_cache = False if not cl._PYOPENCL_NO_CACHE: try: result = generic_scan_kernel_cache[cache_key] from_cache = True logger.debug( "cache hit for generated scan kernel '%s'", self.name_prefix) ( self.first_level_scan_gen_info, self.second_level_scan_gen_info, self.final_update_gen_info) = result except KeyError: pass if not from_cache: logger.debug( "cache miss for generated scan kernel '%s'", self.name_prefix) self._finish_setup_impl() result = (self.first_level_scan_gen_info, self.second_level_scan_gen_info, self.final_update_gen_info) if not cl._PYOPENCL_NO_CACHE: generic_scan_kernel_cache.store_if_not_present(cache_key, result) # Build the kernels. self.first_level_scan_info = self.first_level_scan_gen_info.build( self.context, self.options) del self.first_level_scan_gen_info self.second_level_scan_info = self.second_level_scan_gen_info.build( self.context, self.options) del self.second_level_scan_gen_info self.final_update_info = self.final_update_gen_info.build( self.context, self.options) del self.final_update_gen_info def _finish_setup_impl(self) -> None: # {{{ find usable workgroup/k-group size, build first-level scan trip_count = 0 avail_local_mem = min( dev.local_mem_size for dev in self.devices) if "CUDA" in self.devices[0].platform.name: # not sure where these go, but roughly this much seems unavailable. avail_local_mem -= 0x400 is_cpu = self.devices[0].type & cl.device_type.CPU is_gpu = self.devices[0].type & cl.device_type.GPU if is_cpu: # (about the widest vector a CPU can support, also taking # into account that CPUs don't hide latency by large work groups max_scan_wg_size = 16 wg_size_multiples = 4 else: max_scan_wg_size = min(dev.max_work_group_size for dev in self.devices) wg_size_multiples = 64 # Intel beignet fails "Out of shared local memory" in test_scan int64 # and asserts in test_sort with this enabled: # https://github.com/inducer/pyopencl/pull/238 # A beignet bug report (outside of pyopencl) suggests packed structs # (which this is) can even give wrong results: # https://bugs.freedesktop.org/show_bug.cgi?id=98717 # TODO: does this also affect Intel Compute Runtime? use_bank_conflict_avoidance = ( self.dtype.itemsize > 4 and self.dtype.itemsize % 8 == 0 and is_gpu and "beignet" not in self.devices[0].platform.version.lower()) # k_group_size should be a power of two because of in-kernel # division by that number. solutions = [] for k_exp in range(0, 9): for wg_size in range(wg_size_multiples, max_scan_wg_size+1, wg_size_multiples): k_group_size = 2**k_exp lmem_use = self.get_local_mem_use(wg_size, k_group_size, use_bank_conflict_avoidance) if lmem_use <= avail_local_mem: solutions.append((wg_size*k_group_size, k_group_size, wg_size)) if is_gpu: for wg_size_floor in [256, 192, 128]: have_sol_above_floor = any(wg_size >= wg_size_floor for _, _, wg_size in solutions) if have_sol_above_floor: # delete all solutions not meeting the wg size floor solutions = [(total, try_k_group_size, try_wg_size) for total, try_k_group_size, try_wg_size in solutions if try_wg_size >= wg_size_floor] break _, k_group_size, max_scan_wg_size = max(solutions) while True: candidate_scan_gen_info = self.generate_scan_kernel( max_scan_wg_size, self.parsed_args, _process_code_for_macro(self.input_expr), self.is_segment_start_expr, input_fetch_exprs=self.input_fetch_exprs, is_first_level=True, store_segment_start_flags=self.store_segment_start_flags, k_group_size=k_group_size, use_bank_conflict_avoidance=use_bank_conflict_avoidance) candidate_scan_info = candidate_scan_gen_info.build( self.context, self.options) # Will this device actually let us execute this kernel # at the desired work group size? Building it is the # only way to find out. kernel_max_wg_size = min( candidate_scan_info.kernel.get_work_group_info( cl.kernel_work_group_info.WORK_GROUP_SIZE, dev) for dev in self.devices) if candidate_scan_info.wg_size <= kernel_max_wg_size: break else: max_scan_wg_size = min(kernel_max_wg_size, max_scan_wg_size) trip_count += 1 assert trip_count <= 20 self.first_level_scan_gen_info = candidate_scan_gen_info assert (_round_down_to_power_of_2(candidate_scan_info.wg_size) == candidate_scan_info.wg_size) # }}} # {{{ build second-level scan from pyopencl.tools import VectorArg second_level_arguments = [ *self.parsed_args, VectorArg(self.dtype, "interval_sums"), ] second_level_build_kwargs: Dict[str, Optional[str]] = {} if self.is_segmented: second_level_arguments.append( VectorArg(self.index_dtype, "g_first_segment_start_in_interval_input")) # is_segment_start_expr answers the question "should previous sums # spill over into this item". And since # g_first_segment_start_in_interval_input answers the question if a # segment boundary was found in an interval of data, then if not, # it's ok to spill over. second_level_build_kwargs["is_segment_start_expr"] = \ "g_first_segment_start_in_interval_input[i] != NO_SEG_BOUNDARY" else: second_level_build_kwargs["is_segment_start_expr"] = None self.second_level_scan_gen_info = self.generate_scan_kernel( max_scan_wg_size, arguments=second_level_arguments, input_expr="interval_sums[i]", input_fetch_exprs=[], is_first_level=False, store_segment_start_flags=False, k_group_size=k_group_size, use_bank_conflict_avoidance=use_bank_conflict_avoidance, **second_level_build_kwargs) # }}} # {{{ generate final update kernel update_wg_size = min(max_scan_wg_size, 256) final_update_tpl = _make_template(UPDATE_SOURCE) final_update_src = str(final_update_tpl.render( wg_size=update_wg_size, output_statement=self.output_statement, arg_offset_adjustment=get_arg_offset_adjuster_code(self.parsed_args), argument_signature=", ".join( arg.declarator() for arg in self.parsed_args), is_segment_start_expr=self.is_segment_start_expr, input_expr=_process_code_for_macro(self.input_expr), use_lookbehind_update=self.use_lookbehind_update, **self.code_variables)) update_scalar_arg_dtypes = [ *get_arg_list_scalar_arg_dtypes(self.parsed_args), self.index_dtype, self.index_dtype, None, None] if self.is_segmented: # g_first_segment_start_in_interval update_scalar_arg_dtypes.append(None) if self.store_segment_start_flags: update_scalar_arg_dtypes.append(None) # g_segment_start_flags self.final_update_gen_info = _GeneratedFinalUpdateKernelInfo( final_update_src, self.name_prefix + "_final_update", update_scalar_arg_dtypes, update_wg_size) # }}} # {{{ scan kernel build/properties def get_local_mem_use( self, k_group_size: int, wg_size: int, use_bank_conflict_avoidance: bool) -> int: arg_dtypes = {} for arg in self.parsed_args: arg_dtypes[arg.name] = arg.dtype fetch_expr_offsets: Dict[str, Set] = {} for _name, arg_name, ife_offset in self.input_fetch_exprs: fetch_expr_offsets.setdefault(arg_name, set()).add(ife_offset) itemsize = self.dtype.itemsize if use_bank_conflict_avoidance: itemsize += 4 return ( # ldata itemsize*(k_group_size+1)*(wg_size+1) # l_segment_start_flags + k_group_size*wg_size # l_first_segment_start_in_subtree + self.index_dtype.itemsize*wg_size + k_group_size*wg_size*sum( arg_dtypes[arg_name].itemsize for arg_name, ife_offsets in list(fetch_expr_offsets.items()) if -1 in ife_offsets or len(ife_offsets) > 1)) def generate_scan_kernel( self, max_wg_size: int, arguments: List[DtypedArgument], input_expr: str, is_segment_start_expr: Optional[str], input_fetch_exprs: List[Tuple[str, str, int]], is_first_level: bool, store_segment_start_flags: bool, k_group_size: int, use_bank_conflict_avoidance: bool) -> _GeneratedScanKernelInfo: scalar_arg_dtypes = get_arg_list_scalar_arg_dtypes(arguments) # Empirically found on Nv hardware: no need to be bigger than this size wg_size = _round_down_to_power_of_2( min(max_wg_size, 256)) kernel_name = self.code_variables["name_prefix"] if is_first_level: kernel_name += "_lev1" else: kernel_name += "_lev2" scan_tpl = _make_template(SCAN_INTERVALS_SOURCE) scan_src = str(scan_tpl.render( wg_size=wg_size, input_expr=input_expr, k_group_size=k_group_size, arg_offset_adjustment=get_arg_offset_adjuster_code(arguments), argument_signature=", ".join(arg.declarator() for arg in arguments), is_segment_start_expr=is_segment_start_expr, input_fetch_exprs=input_fetch_exprs, is_first_level=is_first_level, store_segment_start_flags=store_segment_start_flags, use_bank_conflict_avoidance=use_bank_conflict_avoidance, kernel_name=kernel_name, **self.code_variables)) scalar_arg_dtypes.extend( (None, self.index_dtype, self.index_dtype)) if is_first_level: scalar_arg_dtypes.append(None) # interval_results if self.is_segmented and is_first_level: scalar_arg_dtypes.append(None) # g_first_segment_start_in_interval if store_segment_start_flags: scalar_arg_dtypes.append(None) # g_segment_start_flags return _GeneratedScanKernelInfo( scan_src=scan_src, kernel_name=kernel_name, scalar_arg_dtypes=scalar_arg_dtypes, wg_size=wg_size, k_group_size=k_group_size) # }}} def __call__(self, *args: Any, **kwargs: Any) -> cl.Event: """ |std-enqueue-blurb| .. note:: The returned :class:`pyopencl.Event` corresponds only to part of the execution of the scan. It is not suitable for profiling. :arg queue: queue on which to execute the scan. If not given, the queue of the first :class:`pyopencl.array.Array` in *args* is used :arg allocator: an allocator for the temporary arrays and results. If not given, the allocator of the first :class:`pyopencl.array.Array` in *args* is used. :arg size: specify the length of the scan to be carried out. If not given, this length is inferred from the first argument :arg wait_for: a :class:`list` of events to wait for. """ # {{{ argument processing allocator = kwargs.get("allocator") queue = kwargs.get("queue") n = kwargs.get("size") wait_for = kwargs.get("wait_for") if wait_for is None: wait_for = [] else: wait_for = list(wait_for) if len(args) != len(self.parsed_args): raise TypeError( f"expected {len(self.parsed_args)} arguments, got {len(args)}") first_array = args[self.first_array_idx] allocator = allocator or first_array.allocator queue = queue or first_array.queue if n is None: n, = first_array.shape if n == 0: # We're done here. (But pretend to return an event.) return cl.enqueue_marker(queue, wait_for=wait_for) data_args = [] for arg_descr, arg_val in zip(self.parsed_args, args): from pyopencl.tools import VectorArg if isinstance(arg_descr, VectorArg): data_args.append(arg_val.base_data) if arg_descr.with_offset: data_args.append(arg_val.offset) wait_for.extend(arg_val.events) else: data_args.append(arg_val) # }}} l1_info = self.first_level_scan_info l2_info = self.second_level_scan_info # see CL source above for terminology unit_size = l1_info.wg_size * l1_info.k_group_size max_intervals = 3*max(dev.max_compute_units for dev in self.devices) from pytools import uniform_interval_splitting interval_size, num_intervals = uniform_interval_splitting( n, unit_size, max_intervals) # {{{ allocate some buffers interval_results = cl.array.empty(queue, num_intervals, dtype=self.dtype, allocator=allocator) partial_scan_buffer = cl.array.empty( queue, n, dtype=self.dtype, allocator=allocator) if self.store_segment_start_flags: segment_start_flags = cl.array.empty( queue, n, dtype=np.bool_, allocator=allocator) # }}} # {{{ first level scan of interval (one interval per block) scan1_args = [ *data_args, partial_scan_buffer.data, n, interval_size, interval_results.data, ] if self.is_segmented: first_segment_start_in_interval = cl.array.empty(queue, num_intervals, dtype=self.index_dtype, allocator=allocator) scan1_args.append(first_segment_start_in_interval.data) if self.store_segment_start_flags: scan1_args.append(segment_start_flags.data) l1_evt = l1_info.kernel( queue, (num_intervals,), (l1_info.wg_size,), *scan1_args, g_times_l=True, wait_for=wait_for) # }}} # {{{ second level scan of per-interval results # can scan at most one interval assert interval_size >= num_intervals scan2_args = [ *data_args, interval_results.data, # interval_sums ] if self.is_segmented: scan2_args.append(first_segment_start_in_interval.data) scan2_args = [ *scan2_args, interval_results.data, # partial_scan_buffer num_intervals, interval_size] l2_evt = l2_info.kernel( queue, (1,), (l1_info.wg_size,), *scan2_args, g_times_l=True, wait_for=[l1_evt]) # }}} # {{{ update intervals with result of interval scan upd_args = [ *data_args, n, interval_size, interval_results.data, partial_scan_buffer.data] if self.is_segmented: upd_args.append(first_segment_start_in_interval.data) if self.store_segment_start_flags: upd_args.append(segment_start_flags.data) return self.final_update_info.kernel( queue, (num_intervals,), (self.final_update_info.update_wg_size,), *upd_args, g_times_l=True, wait_for=[l2_evt]) # }}} # }}} # {{{ debug kernel DEBUG_SCAN_TEMPLATE = SHARED_PREAMBLE + r"""//CL// KERNEL REQD_WG_SIZE(1, 1, 1) void ${name_prefix}_debug_scan( __global scan_type *scan_tmp, ${argument_signature}, const index_type N) { scan_type current = ${neutral}; scan_type prev; ${arg_offset_adjustment} for (index_type i = 0; i < N; ++i) { %for name, arg_name, ife_offset in input_fetch_exprs: ${arg_ctypes[arg_name]} ${name}; %if ife_offset < 0: if (i+${ife_offset} >= 0) ${name} = ${arg_name}[i+${ife_offset}]; %else: ${name} = ${arg_name}[i]; %endif %endfor scan_type my_val = INPUT_EXPR(i); prev = current; %if is_segmented: bool is_seg_start = IS_SEG_START(i, my_val); %endif current = SCAN_EXPR(prev, my_val, %if is_segmented: is_seg_start %else: false %endif ); scan_tmp[i] = current; } scan_type last_item = scan_tmp[N-1]; for (index_type i = 0; i < N; ++i) { scan_type item = scan_tmp[i]; scan_type prev_item; if (i) prev_item = scan_tmp[i-1]; else prev_item = ${neutral}; { ${output_statement}; } } } """ class GenericDebugScanKernel(GenericScanKernelBase): """ Performs the same function and has the same interface as :class:`GenericScanKernel`, but uses a dead-simple, sequential scan. Works best on CPU platforms, and helps isolate bugs in scans by removing the potential for issues originating in parallel execution. .. automethod:: __call__ """ def finish_setup(self) -> None: scan_tpl = _make_template(DEBUG_SCAN_TEMPLATE) scan_src = str(scan_tpl.render( output_statement=self.output_statement, arg_offset_adjustment=get_arg_offset_adjuster_code(self.parsed_args), argument_signature=", ".join( arg.declarator() for arg in self.parsed_args), is_segment_start_expr=self.is_segment_start_expr, input_expr=_process_code_for_macro(self.input_expr), input_fetch_exprs=self.input_fetch_exprs, wg_size=1, **self.code_variables)) scan_prg = cl.Program(self.context, scan_src).build(self.options) self.kernel = getattr(scan_prg, f"{self.name_prefix}_debug_scan") scalar_arg_dtypes = [ None, *get_arg_list_scalar_arg_dtypes(self.parsed_args), self.index_dtype, ] self.kernel.set_scalar_arg_dtypes(scalar_arg_dtypes) def __call__(self, *args: Any, **kwargs: Any) -> cl.Event: """See :meth:`GenericScanKernel.__call__`.""" # {{{ argument processing allocator = kwargs.get("allocator") queue = kwargs.get("queue") n = kwargs.get("size") wait_for = kwargs.get("wait_for") if wait_for is None: wait_for = [] else: # We'll be modifying it below. wait_for = list(wait_for) if len(args) != len(self.parsed_args): raise TypeError( f"expected {len(self.parsed_args)} arguments, got {len(args)}") first_array = args[self.first_array_idx] allocator = allocator or first_array.allocator queue = queue or first_array.queue if n is None: n, = first_array.shape scan_tmp = cl.array.empty(queue, n, dtype=self.dtype, allocator=allocator) data_args = [scan_tmp.data] from pyopencl.tools import VectorArg for arg_descr, arg_val in zip(self.parsed_args, args): if isinstance(arg_descr, VectorArg): data_args.append(arg_val.base_data) if arg_descr.with_offset: data_args.append(arg_val.offset) wait_for.extend(arg_val.events) else: data_args.append(arg_val) # }}} return self.kernel(queue, (1,), (1,), *([*data_args, n]), wait_for=wait_for) # }}} # {{{ compatibility interface class _LegacyScanKernelBase(GenericScanKernel): def __init__(self, ctx, dtype, scan_expr, neutral=None, name_prefix="scan", options=None, preamble="", devices=None): scan_ctype = dtype_to_ctype(dtype) GenericScanKernel.__init__(self, ctx, dtype, arguments="__global {} *input_ary, __global {} *output_ary".format( scan_ctype, scan_ctype), input_expr="input_ary[i]", scan_expr=scan_expr, neutral=neutral, output_statement=self.ary_output_statement, options=options, preamble=preamble, devices=devices) @property def ary_output_statement(self): raise NotImplementedError def __call__(self, input_ary, output_ary=None, allocator=None, queue=None): allocator = allocator or input_ary.allocator queue = queue or input_ary.queue or output_ary.queue if output_ary is None: output_ary = input_ary if isinstance(output_ary, (str, str)) and output_ary == "new": output_ary = cl.array.empty_like(input_ary, allocator=allocator) if input_ary.shape != output_ary.shape: raise ValueError("input and output must have the same shape") if not input_ary.flags.forc: raise RuntimeError("ScanKernel cannot " "deal with non-contiguous arrays") n, = input_ary.shape if not n: return output_ary GenericScanKernel.__call__(self, input_ary, output_ary, allocator=allocator, queue=queue) return output_ary class InclusiveScanKernel(_LegacyScanKernelBase): ary_output_statement = "output_ary[i] = item;" class ExclusiveScanKernel(_LegacyScanKernelBase): ary_output_statement = "output_ary[i] = prev_item;" # }}} # {{{ template class ScanTemplate(KernelTemplateBase): def __init__( self, arguments: Union[str, List[DtypedArgument]], input_expr: str, scan_expr: str, neutral: Optional[str], output_statement: str, is_segment_start_expr: Optional[str] = None, input_fetch_exprs: Optional[List[Tuple[str, str, int]]] = None, name_prefix: str = "scan", preamble: str = "", template_processor: Any = None) -> None: super().__init__(template_processor=template_processor) if input_fetch_exprs is None: input_fetch_exprs = [] self.arguments = arguments self.input_expr = input_expr self.scan_expr = scan_expr self.neutral = neutral self.output_statement = output_statement self.is_segment_start_expr = is_segment_start_expr self.input_fetch_exprs = input_fetch_exprs self.name_prefix = name_prefix self.preamble = preamble def build_inner(self, context, type_aliases=(), var_values=(), more_preamble="", more_arguments=(), declare_types=(), options=None, devices=None, scan_cls=GenericScanKernel): renderer = self.get_renderer(type_aliases, var_values, context, options) arg_list = renderer.render_argument_list(self.arguments, more_arguments) type_decl_preamble = renderer.get_type_decl_preamble( context.devices[0], declare_types, arg_list) return scan_cls(context, renderer.type_aliases["scan_t"], renderer.render_argument_list(self.arguments, more_arguments), renderer(self.input_expr), renderer(self.scan_expr), renderer(self.neutral), renderer(self.output_statement), is_segment_start_expr=renderer(self.is_segment_start_expr), input_fetch_exprs=self.input_fetch_exprs, index_dtype=renderer.type_aliases.get("index_t", np.int32), name_prefix=renderer(self.name_prefix), options=options, preamble=( type_decl_preamble + "\n" + renderer(self.preamble + "\n" + more_preamble)), devices=devices) # }}} # {{{ 'canned' scan kernels @context_dependent_memoize def get_cumsum_kernel(context, input_dtype, output_dtype): from pyopencl.tools import VectorArg return GenericScanKernel( context, output_dtype, arguments=[ VectorArg(input_dtype, "input"), VectorArg(output_dtype, "output"), ], input_expr="input[i]", scan_expr="a+b", neutral="0", output_statement=""" output[i] = item; """) # }}} # vim: filetype=pyopencl:fdm=marker pyopencl-2025.1/pyopencl/tools.py0000644000000000000000000013156414332717401013741 0ustar00r""" .. _memory-pools: Memory Pools ------------ Memory allocation (e.g. in the form of the :func:`pyopencl.Buffer` constructor) can be expensive if used frequently. For example, code based on :class:`pyopencl.array.Array` can easily run into this issue because a fresh memory area is allocated for each intermediate result. Memory pools are a remedy for this problem based on the observation that often many of the block allocations are of the same sizes as previously used ones. Then, instead of fully returning the memory to the system and incurring the associated reallocation overhead, the pool holds on to the memory and uses it to satisfy future allocations of similarly-sized blocks. The pool reacts appropriately to out-of-memory conditions as long as all memory allocations are made through it. Allocations performed from outside of the pool may run into spurious out-of-memory conditions due to the pool owning much or all of the available memory. There are two flavors of allocators and memory pools: - :ref:`buf-mempool` - :ref:`svm-mempool` Using :class:`pyopencl.array.Array`\ s can be used with memory pools in a straightforward manner:: mem_pool = pyopencl.tools.MemoryPool(pyopencl.tools.ImmediateAllocator(queue)) a_dev = cl_array.arange(queue, 2000, dtype=np.float32, allocator=mem_pool) Likewise, SVM-based allocators are directly usable with :class:`pyopencl.array.Array`. .. _buf-mempool: :class:`~pyopencl.Buffer`-based Allocators and Memory Pools ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. autoclass:: PooledBuffer .. autoclass:: AllocatorBase .. autoclass:: DeferredAllocator .. autoclass:: ImmediateAllocator .. autoclass:: MemoryPool .. _svm-mempool: :ref:`SVM `-Based Allocators and Memory Pools ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ SVM functionality requires OpenCL 2.0. .. autoclass:: PooledSVM .. autoclass:: SVMAllocator .. autoclass:: SVMPool CL-Object-dependent Caching --------------------------- .. autofunction:: first_arg_dependent_memoize .. autofunction:: clear_first_arg_caches Testing ------- .. autofunction:: pytest_generate_tests_for_pyopencl Argument Types -------------- .. autoclass:: Argument .. autoclass:: DtypedArgument .. autoclass:: VectorArg .. autoclass:: ScalarArg .. autoclass:: OtherArg .. autofunction:: parse_arg_list Device Characterization ----------------------- .. automodule:: pyopencl.characterize :members: Type aliases ------------ .. currentmodule:: pyopencl._cl .. class:: AllocatorBase See :class:`pyopencl.tools.AllocatorBase`. """ __copyright__ = "Copyright (C) 2010 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import re from abc import ABC, abstractmethod from sys import intern from typing import Any, List, Optional, Union import numpy as np from pytools import memoize, memoize_method from pytools.persistent_dict import KeyBuilder as KeyBuilderBase from pyopencl._cl import bitlog2, get_cl_header_version # noqa: F401 from pyopencl.compyte.dtypes import ( # noqa: F401 TypeNameNotKnown, dtype_to_ctype, get_or_register_dtype, register_dtype, ) # Do not add a pyopencl import here: This will add an import cycle. def _register_types(): from pyopencl.compyte.dtypes import TYPE_REGISTRY, fill_registry_with_opencl_c_types fill_registry_with_opencl_c_types(TYPE_REGISTRY) get_or_register_dtype("cfloat_t", np.complex64) get_or_register_dtype("cdouble_t", np.complex128) _register_types() # {{{ imported names from pyopencl._cl import ( AllocatorBase, DeferredAllocator, ImmediateAllocator, MemoryPool, PooledBuffer, ) if get_cl_header_version() >= (2, 0): from pyopencl._cl import PooledSVM, SVMAllocator, SVMPool # }}} # {{{ monkeypatch docstrings into imported interfaces _MEMPOOL_IFACE_DOCS = """ .. note:: The current implementation of the memory pool will retain allocated memory after it is returned by the application and keep it in a bin identified by the leading *leading_bits_in_bin_id* bits of the allocation size. To ensure that allocations within each bin are interchangeable, allocation sizes are rounded up to the largest size that shares the leading bits of the requested allocation size. The current default value of *leading_bits_in_bin_id* is four, but this may change in future versions and is not guaranteed. *leading_bits_in_bin_id* must be passed by keyword, and its role is purely advisory. It is not guaranteed that future versions of the pool will use the same allocation scheme and/or honor *leading_bits_in_bin_id*. .. attribute:: held_blocks The number of unused blocks being held by this pool. .. attribute:: active_blocks The number of blocks in active use that have been allocated through this pool. .. attribute:: managed_bytes "Managed" memory is "active" and "held" memory. .. versionadded:: 2021.1.2 .. attribute:: active_bytes "Active" bytes are bytes under the control of the application. This may be smaller than the actual allocated size reflected in :attr:`managed_bytes`. .. versionadded:: 2021.1.2 .. method:: free_held Free all unused memory that the pool is currently holding. .. method:: stop_holding Instruct the memory to start immediately freeing memory returned to it, instead of holding it for future allocations. Implicitly calls :meth:`free_held`. This is useful as a cleanup action when a memory pool falls out of use. """ def _monkeypatch_docstrings(): from pytools.codegen import remove_common_indentation PooledBuffer.__doc__ = """ An object representing a :class:`MemoryPool`-based allocation of :class:`~pyopencl.Buffer`-style device memory. Analogous to :class:`~pyopencl.Buffer`, however once this object is deleted, its associated device memory is returned to the pool. Is a :class:`pyopencl.MemoryObject`. """ AllocatorBase.__doc__ = """ An interface implemented by various memory allocation functions in :mod:`pyopencl`. .. automethod:: __call__ Allocate and return a :class:`pyopencl.Buffer` of the given *size*. """ # {{{ DeferredAllocator DeferredAllocator.__doc__ = """ *mem_flags* takes its values from :class:`pyopencl.mem_flags` and corresponds to the *flags* argument of :class:`pyopencl.Buffer`. DeferredAllocator has the same semantics as regular OpenCL buffer allocation, i.e. it may promise memory to be available that may (in any call to a buffer-using CL function) turn out to not exist later on. (Allocations in CL are bound to contexts, not devices, and memory availability depends on which device the buffer is used with.) Implements :class:`AllocatorBase`. .. versionchanged :: 2013.1 ``CLAllocator`` was deprecated and replaced by :class:`DeferredAllocator`. .. method:: __init__(context, mem_flags=pyopencl.mem_flags.READ_WRITE) .. automethod:: __call__ Allocate a :class:`pyopencl.Buffer` of the given *size*. .. versionchanged :: 2020.2 The allocator will succeed even for allocations of size zero, returning *None*. """ # }}} # {{{ ImmediateAllocator ImmediateAllocator.__doc__ = """ *mem_flags* takes its values from :class:`pyopencl.mem_flags` and corresponds to the *flags* argument of :class:`pyopencl.Buffer`. :class:`ImmediateAllocator` will attempt to ensure at allocation time that allocated memory is actually available. If no memory is available, an out-of-memory error is reported at allocation time. Implements :class:`AllocatorBase`. .. versionadded:: 2013.1 .. method:: __init__(queue, mem_flags=pyopencl.mem_flags.READ_WRITE) .. automethod:: __call__ Allocate a :class:`pyopencl.Buffer` of the given *size*. .. versionchanged :: 2020.2 The allocator will succeed even for allocations of size zero, returning *None*. """ # }}} # {{{ MemoryPool MemoryPool.__doc__ = remove_common_indentation(""" A memory pool for OpenCL device memory in :class:`pyopencl.Buffer` form. *allocator* must be an instance of one of the above classes, and should be an :class:`ImmediateAllocator`. The memory pool assumes that allocation failures are reported by the allocator immediately, and not in the OpenCL-typical deferred manner. Implements :class:`AllocatorBase`. .. versionchanged:: 2019.1 Current bin allocation behavior documented, *leading_bits_in_bin_id* added. .. automethod:: __init__ .. automethod:: allocate Return a :class:`PooledBuffer` of the given *size*. .. automethod:: __call__ Synonym for :meth:`allocate` to match :class:`AllocatorBase`. .. versionadded:: 2011.2 """) + _MEMPOOL_IFACE_DOCS # }}} _monkeypatch_docstrings() def _monkeypatch_svm_docstrings(): from pytools.codegen import remove_common_indentation # {{{ PooledSVM PooledSVM.__doc__ = ( # pylint: disable=possibly-used-before-assignment """An object representing a :class:`SVMPool`-based allocation of :ref:`svm`. Analogous to :class:`~pyopencl.SVMAllocation`, however once this object is deleted, its associated device memory is returned to the pool from which it came. .. versionadded:: 2022.2 .. note:: If the :class:`SVMAllocator` for the :class:`SVMPool` that allocated an object of this type is associated with an (in-order) :class:`~pyopencl.CommandQueue`, sufficient synchronization is provided to ensure operations enqueued before deallocation complete before operations from a different use (possibly in a different queue) are permitted to start. This applies when :class:`release` is called and also when the object is freed automatically by the garbage collector. Is a :class:`pyopencl.SVMPointer`. Supports structural equality and hashing. .. automethod:: release Return the held memory to the pool. See the note about synchronization behavior during deallocation above. .. automethod:: enqueue_release Synonymous to :meth:`release`, for consistency with :class:`~pyopencl.SVMAllocation`. Note that, unlike :meth:`pyopencl.SVMAllocation.enqueue_release`, specifying a queue or events to be waited for is not supported. .. automethod:: bind_to_queue Analogous to :meth:`pyopencl.SVMAllocation.bind_to_queue`. .. automethod:: unbind_from_queue Analogous to :meth:`pyopencl.SVMAllocation.unbind_from_queue`. """) # }}} # {{{ SVMAllocator SVMAllocator.__doc__ = ( # pylint: disable=possibly-used-before-assignment """ .. versionadded:: 2022.2 .. automethod:: __init__ :arg flags: See :class:`~pyopencl.svm_mem_flags`. :arg queue: If not specified, allocations will be freed eagerly, irrespective of whether pending/enqueued operations are still using the memory. If specified, deallocation of memory will be enqueued with the given queue, and will only be performed after previously-enqueue operations in the queue have completed. It is an error to specify an out-of-order queue. .. warning:: Not specifying a queue will typically lead to undesired behavior, including crashes and memory corruption. See the warning in :ref:`svm`. .. automethod:: __call__ Return a :class:`~pyopencl.SVMAllocation` of the given *size*. """) # }}} # {{{ SVMPool SVMPool.__doc__ = ( # pylint: disable=possibly-used-before-assignment remove_common_indentation(""" A memory pool for OpenCL device memory in :ref:`SVM ` form. *allocator* must be an instance of :class:`SVMAllocator`. .. versionadded:: 2022.2 .. automethod:: __init__ .. automethod:: __call__ Return a :class:`PooledSVM` of the given *size*. """) + _MEMPOOL_IFACE_DOCS) # }}} if get_cl_header_version() >= (2, 0): _monkeypatch_svm_docstrings() # }}} # {{{ first-arg caches _first_arg_dependent_caches = [] def first_arg_dependent_memoize(func): def wrapper(cl_object, *args, **kwargs): """Provides memoization for a function. Typically used to cache things that get created inside a :class:`pyopencl.Context`, e.g. programs and kernels. Assumes that the first argument of the decorated function is an OpenCL object that might go away, such as a :class:`pyopencl.Context` or a :class:`pyopencl.CommandQueue`, and based on which we might want to clear the cache. .. versionadded:: 2011.2 """ if kwargs: cache_key = (args, frozenset(kwargs.items())) else: cache_key = (args,) try: ctx_dict = func._pyopencl_first_arg_dep_memoize_dic except AttributeError: # FIXME: This may keep contexts alive longer than desired. # But I guess since the memory in them is freed, who cares. ctx_dict = func._pyopencl_first_arg_dep_memoize_dic = {} _first_arg_dependent_caches.append(ctx_dict) try: return ctx_dict[cl_object][cache_key] except KeyError: arg_dict = ctx_dict.setdefault(cl_object, {}) result = func(cl_object, *args, **kwargs) arg_dict[cache_key] = result return result from functools import update_wrapper update_wrapper(wrapper, func) return wrapper context_dependent_memoize = first_arg_dependent_memoize def first_arg_dependent_memoize_nested(nested_func): """Provides memoization for nested functions. Typically used to cache things that get created inside a :class:`pyopencl.Context`, e.g. programs and kernels. Assumes that the first argument of the decorated function is an OpenCL object that might go away, such as a :class:`pyopencl.Context` or a :class:`pyopencl.CommandQueue`, and will therefore respond to :func:`clear_first_arg_caches`. .. versionadded:: 2013.1 Requires Python 2.5 or newer. """ from functools import wraps cache_dict_name = intern("_memoize_inner_dic_%s_%s_%d" % (nested_func.__name__, nested_func.__code__.co_filename, nested_func.__code__.co_firstlineno)) from inspect import currentframe # prevent ref cycle try: caller_frame = currentframe().f_back cache_context = caller_frame.f_globals[ caller_frame.f_code.co_name] finally: # del caller_frame pass try: cache_dict = getattr(cache_context, cache_dict_name) except AttributeError: cache_dict = {} _first_arg_dependent_caches.append(cache_dict) setattr(cache_context, cache_dict_name, cache_dict) @wraps(nested_func) def new_nested_func(cl_object, *args): try: return cache_dict[cl_object][args] except KeyError: arg_dict = cache_dict.setdefault(cl_object, {}) result = nested_func(cl_object, *args) arg_dict[args] = result return result return new_nested_func def clear_first_arg_caches(): """Empties all first-argument-dependent memoization caches. Also releases all held reference contexts. If it is important to you that the program detaches from its context, you might need to call this function to free all remaining references to your context. .. versionadded:: 2011.2 """ for cache in _first_arg_dependent_caches: cache.clear() import atexit atexit.register(clear_first_arg_caches) # }}} # {{{ pytest fixtures class _ContextFactory: def __init__(self, device): self.device = device def __call__(self): # Get rid of leftovers from past tests. # CL implementations are surprisingly limited in how many # simultaneous contexts they allow... clear_first_arg_caches() from gc import collect collect() import pyopencl as cl return cl.Context([self.device]) def __str__(self): # Don't show address, so that parallel test collection works return (">" % (self.device.name.strip(), self.device.platform.name.strip())) def get_test_platforms_and_devices(plat_dev_string=None): """Parse a string of the form 'PYOPENCL_TEST=0:0,1;intel:i5'. :return: list of tuples (platform, [device, device, ...]) """ import pyopencl as cl if plat_dev_string is None: import os plat_dev_string = os.environ.get("PYOPENCL_TEST", None) def find_cl_obj(objs, identifier): try: num = int(identifier) except Exception: pass else: return objs[num] found = False for obj in objs: if identifier.lower() in (obj.name + " " + obj.vendor).lower(): return obj if not found: raise RuntimeError("object '%s' not found" % identifier) if plat_dev_string: result = [] for entry in plat_dev_string.split(";"): lhsrhs = entry.split(":") if len(lhsrhs) == 1: platform = find_cl_obj(cl.get_platforms(), lhsrhs[0]) result.append((platform, platform.get_devices())) elif len(lhsrhs) != 2: raise RuntimeError("invalid syntax of PYOPENCL_TEST") else: plat_str, dev_strs = lhsrhs platform = find_cl_obj(cl.get_platforms(), plat_str) devs = platform.get_devices() result.append( (platform, [find_cl_obj(devs, dev_id) for dev_id in dev_strs.split(",")])) return result else: return [ (platform, platform.get_devices()) for platform in cl.get_platforms()] def get_pyopencl_fixture_arg_names(metafunc, extra_arg_names=None): if extra_arg_names is None: extra_arg_names = [] supported_arg_names = [ "platform", "device", "ctx_factory", "ctx_getter", *extra_arg_names ] arg_names = [] for arg in supported_arg_names: if arg not in metafunc.fixturenames: continue if arg == "ctx_getter": from warnings import warn warn( "The 'ctx_getter' arg is deprecated in favor of 'ctx_factory'.", DeprecationWarning, stacklevel=2) arg_names.append(arg) return arg_names def get_pyopencl_fixture_arg_values(): import pyopencl as cl arg_values = [] for platform, devices in get_test_platforms_and_devices(): for device in devices: arg_dict = { "platform": platform, "device": device, "ctx_factory": _ContextFactory(device), "ctx_getter": _ContextFactory(device) } arg_values.append(arg_dict) def idfn(val): if isinstance(val, cl.Platform): # Don't show address, so that parallel test collection works return f"" else: return str(val) return arg_values, idfn def pytest_generate_tests_for_pyopencl(metafunc): """Using the line:: from pyopencl.tools import pytest_generate_tests_for_pyopencl as pytest_generate_tests in your `pytest `__ test scripts allows you to use the arguments *ctx_factory*, *device*, or *platform* in your test functions, and they will automatically be run for each OpenCL device/platform in the system, as appropriate. The following two environment variabls is also supported to control device/platform choice:: PYOPENCL_TEST=0:0,1;intel=i5,i7 """ arg_names = get_pyopencl_fixture_arg_names(metafunc) if not arg_names: return arg_values, ids = get_pyopencl_fixture_arg_values() arg_values = [ tuple(arg_dict[name] for name in arg_names) for arg_dict in arg_values ] metafunc.parametrize(arg_names, arg_values, ids=ids) # }}} # {{{ C argument lists class Argument(ABC): """ .. automethod:: declarator """ @abstractmethod def declarator(self) -> str: pass class DtypedArgument(Argument): """ .. attribute:: name .. attribute:: dtype """ def __init__(self, dtype: Any, name: str) -> None: self.dtype = np.dtype(dtype) self.name = name def __repr__(self) -> str: return "{}({!r}, {})".format( self.__class__.__name__, self.name, self.dtype) def __eq__(self, other: Any) -> bool: return (type(self) is type(other) and self.dtype == other.dtype and self.name == other.name) def __hash__(self) -> int: return ( hash(type(self)) ^ hash(self.dtype) ^ hash(self.name)) class VectorArg(DtypedArgument): """Inherits from :class:`DtypedArgument`. .. automethod:: __init__ """ def __init__(self, dtype: Any, name: str, with_offset: bool = False) -> None: super().__init__(dtype, name) self.with_offset = with_offset def declarator(self) -> str: if self.with_offset: # Two underscores -> less likelihood of a name clash. return "__global {} *{}__base, long {}__offset".format( dtype_to_ctype(self.dtype), self.name, self.name) else: result = "__global {} *{}".format(dtype_to_ctype(self.dtype), self.name) return result def __eq__(self, other) -> bool: return (super().__eq__(other) and self.with_offset == other.with_offset) def __hash__(self) -> int: return super().__hash__() ^ hash(self.with_offset) class ScalarArg(DtypedArgument): """Inherits from :class:`DtypedArgument`.""" def declarator(self): return "{} {}".format(dtype_to_ctype(self.dtype), self.name) class OtherArg(Argument): def __init__(self, declarator: str, name: str) -> None: self.decl = declarator self.name = name def declarator(self) -> str: return self.decl def __eq__(self, other) -> bool: return (type(self) is type(other) and self.decl == other.decl and self.name == other.name) def __hash__(self) -> int: return ( hash(type(self)) ^ hash(self.decl) ^ hash(self.name)) def parse_c_arg(c_arg: str, with_offset: bool = False) -> DtypedArgument: for aspace in ["__local", "__constant"]: if aspace in c_arg: raise RuntimeError("cannot deal with local or constant " "OpenCL address spaces in C argument lists ") c_arg = c_arg.replace("__global", "") if with_offset: def vec_arg_factory(dtype, name): return VectorArg(dtype, name, with_offset=True) else: vec_arg_factory = VectorArg from pyopencl.compyte.dtypes import parse_c_arg_backend return parse_c_arg_backend(c_arg, ScalarArg, vec_arg_factory) def parse_arg_list( arguments: Union[str, List[str], List[DtypedArgument]], with_offset: bool = False) -> List[DtypedArgument]: """Parse a list of kernel arguments. *arguments* may be a comma-separate list of C declarators in a string, a list of strings representing C declarators, or :class:`Argument` objects. """ if isinstance(arguments, str): arguments = arguments.split(",") def parse_single_arg(obj: Union[str, DtypedArgument]) -> DtypedArgument: if isinstance(obj, str): from pyopencl.tools import parse_c_arg return parse_c_arg(obj, with_offset=with_offset) else: assert isinstance(obj, DtypedArgument) return obj return [parse_single_arg(arg) for arg in arguments] def get_arg_list_arg_types(arg_types): result = [] for arg_type in arg_types: if isinstance(arg_type, ScalarArg): result.append(arg_type.dtype) elif isinstance(arg_type, VectorArg): result.append(arg_type) else: raise RuntimeError("arg type not understood: %s" % type(arg_type)) return tuple(result) def get_arg_list_scalar_arg_dtypes( arg_types: List[DtypedArgument] ) -> List[Optional[np.dtype]]: result: List[Optional[np.dtype]] = [] for arg_type in arg_types: if isinstance(arg_type, ScalarArg): result.append(arg_type.dtype) elif isinstance(arg_type, VectorArg): result.append(None) if arg_type.with_offset: result.append(np.dtype(np.int64)) else: raise RuntimeError(f"arg type not understood: {type(arg_type)}") return result def get_arg_offset_adjuster_code(arg_types): result = [] for arg_type in arg_types: if isinstance(arg_type, VectorArg) and arg_type.with_offset: result.append("__global %(type)s *%(name)s = " "(__global %(type)s *) " "((__global char *) %(name)s__base + %(name)s__offset);" % { "type": dtype_to_ctype(arg_type.dtype), "name": arg_type.name}) return "\n".join(result) # }}} def get_gl_sharing_context_properties(): import pyopencl as cl ctx_props = cl.context_properties from OpenGL import platform as gl_platform props = [] import sys if sys.platform in ["linux", "linux2"]: from OpenGL import GLX props.append( (ctx_props.GL_CONTEXT_KHR, GLX.glXGetCurrentContext())) props.append( (ctx_props.GLX_DISPLAY_KHR, GLX.glXGetCurrentDisplay())) elif sys.platform == "win32": from OpenGL import WGL props.append( (ctx_props.GL_CONTEXT_KHR, gl_platform.GetCurrentContext())) props.append( (ctx_props.WGL_HDC_KHR, WGL.wglGetCurrentDC())) elif sys.platform == "darwin": props.append( (ctx_props.CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE, cl.get_apple_cgl_share_group())) else: raise NotImplementedError("platform '%s' not yet supported" % sys.platform) return props class _CDeclList: def __init__(self, device): self.device = device self.declared_dtypes = set() self.declarations = [] self.saw_double = False self.saw_complex = False def add_dtype(self, dtype): dtype = np.dtype(dtype) if dtype in (np.float64, np.complex128): self.saw_double = True if dtype.kind == "c": self.saw_complex = True if dtype.kind != "V": return if dtype in self.declared_dtypes: return import pyopencl.cltypes if dtype in pyopencl.cltypes.vec_type_to_scalar_and_count: return if hasattr(dtype, "subdtype") and dtype.subdtype is not None: self.add_dtype(dtype.subdtype[0]) return for _name, field_data in sorted(dtype.fields.items()): field_dtype, _offset = field_data[:2] self.add_dtype(field_dtype) _, cdecl = match_dtype_to_c_struct( self.device, dtype_to_ctype(dtype), dtype) self.declarations.append(cdecl) self.declared_dtypes.add(dtype) def visit_arguments(self, arguments): for arg in arguments: dtype = arg.dtype if dtype in (np.float64, np.complex128): self.saw_double = True if dtype.kind == "c": self.saw_complex = True def get_declarations(self): result = "\n\n".join(self.declarations) if self.saw_complex: result = ( "#include \n\n" + result) if self.saw_double: result = ( """ #if __OPENCL_C_VERSION__ < 120 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif #define PYOPENCL_DEFINE_CDOUBLE """ + result) return result @memoize def match_dtype_to_c_struct(device, name, dtype, context=None): """Return a tuple ``(dtype, c_decl)`` such that the C struct declaration in ``c_decl`` and the structure :class:`numpy.dtype` instance ``dtype`` have the same memory layout. Note that *dtype* may be modified from the value that was passed in, for example to insert padding. (As a remark on implementation, this routine runs a small kernel on the given *device* to ensure that :mod:`numpy` and C offsets and sizes match.) .. versionadded:: 2013.1 This example explains the use of this function:: >>> import numpy as np >>> import pyopencl as cl >>> import pyopencl.tools >>> ctx = cl.create_some_context() >>> dtype = np.dtype([("id", np.uint32), ("value", np.float32)]) >>> dtype, c_decl = pyopencl.tools.match_dtype_to_c_struct( ... ctx.devices[0], 'id_val', dtype) >>> print c_decl typedef struct { unsigned id; float value; } id_val; >>> print dtype [('id', '>> cl.tools.get_or_register_dtype('id_val', dtype) As this example shows, it is important to call :func:`get_or_register_dtype` on the modified ``dtype`` returned by this function, not the original one. """ import pyopencl as cl fields = sorted(dtype.fields.items(), key=lambda name_dtype_offset: name_dtype_offset[1][1]) c_fields = [] for field_name, dtype_and_offset in fields: field_dtype, _offset = dtype_and_offset[:2] if hasattr(field_dtype, "subdtype") and field_dtype.subdtype is not None: array_dtype = field_dtype.subdtype[0] if hasattr(array_dtype, "subdtype") and array_dtype.subdtype is not None: raise NotImplementedError("nested array dtypes are not supported") array_dims = field_dtype.subdtype[1] dims_str = "" try: for dim in array_dims: dims_str += "[%d]" % dim except TypeError: dims_str = "[%d]" % array_dims c_fields.append(" {} {}{};".format( dtype_to_ctype(array_dtype), field_name, dims_str) ) else: c_fields.append( " {} {};".format(dtype_to_ctype(field_dtype), field_name)) c_decl = "typedef struct {{\n{}\n}} {};\n\n".format( "\n".join(c_fields), name) cdl = _CDeclList(device) for _field_name, dtype_and_offset in fields: field_dtype, _offset = dtype_and_offset[:2] cdl.add_dtype(field_dtype) pre_decls = cdl.get_declarations() offset_code = "\n".join( "result[%d] = pycl_offsetof(%s, %s);" % (i+1, name, field_name) for i, (field_name, _) in enumerate(fields)) src = rf""" #define pycl_offsetof(st, m) \ ((uint) ((__local char *) &(dummy.m) \ - (__local char *)&dummy )) {pre_decls} {c_decl} __kernel void get_size_and_offsets(__global uint *result) {{ result[0] = sizeof({name}); __local {name} dummy; {offset_code} }} """ if context is None: context = cl.Context([device]) queue = cl.CommandQueue(context) prg = cl.Program(context, src) knl = prg.build(devices=[device]).get_size_and_offsets import pyopencl.array result_buf = cl.array.empty(queue, 1+len(fields), np.uint32) knl(queue, (1,), (1,), result_buf.data) queue.finish() size_and_offsets = result_buf.get() size = int(size_and_offsets[0]) offsets = size_and_offsets[1:] if any(ofs >= size for ofs in offsets): # offsets not plausible if dtype.itemsize == size: # If sizes match, use numpy's idea of the offsets. offsets = [dtype_and_offset[1] for field_name, dtype_and_offset in fields] else: raise RuntimeError( "OpenCL compiler reported offsetof() past sizeof() " "for struct layout on '%s'. " "This makes no sense, and it's usually indicates a " "compiler bug. " "Refusing to discover struct layout." % device) result_buf.data.release() del knl del prg del queue del context try: dtype_arg_dict = { "names": [field_name for field_name, (field_dtype, offset) in fields], "formats": [field_dtype for field_name, (field_dtype, offset) in fields], "offsets": [int(x) for x in offsets], "itemsize": int(size_and_offsets[0]), } dtype = np.dtype(dtype_arg_dict) if dtype.itemsize != size_and_offsets[0]: # "Old" versions of numpy (1.6.x?) silently ignore "itemsize". Boo. dtype_arg_dict["names"].append("_pycl_size_fixer") dtype_arg_dict["formats"].append(np.uint8) dtype_arg_dict["offsets"].append(int(size_and_offsets[0])-1) dtype = np.dtype(dtype_arg_dict) except NotImplementedError: def calc_field_type(): total_size = 0 padding_count = 0 for offset, (field_name, (field_dtype, _)) in zip(offsets, fields): if offset > total_size: padding_count += 1 yield ("__pycl_padding%d" % padding_count, "V%d" % offset - total_size) yield field_name, field_dtype total_size = field_dtype.itemsize + offset dtype = np.dtype(list(calc_field_type())) assert dtype.itemsize == size_and_offsets[0] return dtype, c_decl @memoize def dtype_to_c_struct(device, dtype): if dtype.fields is None: return "" import pyopencl.cltypes if dtype in pyopencl.cltypes.vec_type_to_scalar_and_count: # Vector types are built-in. Don't try to redeclare those. return "" matched_dtype, c_decl = match_dtype_to_c_struct( device, dtype_to_ctype(dtype), dtype) def dtypes_match(): result = len(dtype.fields) == len(matched_dtype.fields) for name, val in dtype.fields.items(): result = result and matched_dtype.fields[name] == val return result assert dtypes_match() return c_decl # {{{ code generation/templating helper def _process_code_for_macro(code): code = code.replace("//CL//", "\n") if "//" in code: raise RuntimeError("end-of-line comments ('//') may not be used in " "code snippets") return code.replace("\n", " \\\n") class _SimpleTextTemplate: def __init__(self, txt): self.txt = txt def render(self, context): return self.txt class _PrintfTextTemplate: def __init__(self, txt): self.txt = txt def render(self, context): return self.txt % context class _MakoTextTemplate: def __init__(self, txt): from mako.template import Template self.template = Template(txt, strict_undefined=True) def render(self, context): return self.template.render(**context) class _ArgumentPlaceholder: """A placeholder for subclasses of :class:`DtypedArgument`. This is needed because the concrete dtype of the argument is not known at template creation time--it may be a type alias that will only be filled in at run time. These types take the place of these proto-arguments until all types are known. See also :class:`_TemplateRenderer.render_arg`. """ def __init__(self, typename, name, **extra_kwargs): self.typename = typename self.name = name self.extra_kwargs = extra_kwargs class _VectorArgPlaceholder(_ArgumentPlaceholder): target_class = VectorArg class _ScalarArgPlaceholder(_ArgumentPlaceholder): target_class = ScalarArg class _TemplateRenderer: def __init__(self, template, type_aliases, var_values, context=None, options=None): self.template = template self.type_aliases = dict(type_aliases) self.var_dict = dict(var_values) for name in self.var_dict: if name.startswith("macro_"): self.var_dict[name] = _process_code_for_macro( self.var_dict[name]) self.context = context self.options = options def __call__(self, txt): if txt is None: return txt result = self.template.get_text_template(txt).render(self.var_dict) return str(result) def get_rendered_kernel(self, txt, kernel_name): import pyopencl as cl prg = cl.Program(self.context, self(txt)).build(self.options) kernel_name_prefix = self.var_dict.get("kernel_name_prefix") if kernel_name_prefix is not None: kernel_name = kernel_name_prefix+kernel_name return getattr(prg, kernel_name) def parse_type(self, typename): if isinstance(typename, str): try: return self.type_aliases[typename] except KeyError: from pyopencl.compyte.dtypes import NAME_TO_DTYPE return NAME_TO_DTYPE[typename] else: return np.dtype(typename) def render_arg(self, arg_placeholder): return arg_placeholder.target_class( self.parse_type(arg_placeholder.typename), arg_placeholder.name, **arg_placeholder.extra_kwargs) _C_COMMENT_FINDER = re.compile(r"/\*.*?\*/") def render_argument_list(self, *arg_lists, **kwargs): with_offset = kwargs.pop("with_offset", False) if kwargs: raise TypeError("unrecognized kwargs: " + ", ".join(kwargs)) all_args = [] for arg_list in arg_lists: if isinstance(arg_list, str): arg_list = str( self.template .get_text_template(arg_list).render(self.var_dict)) arg_list = self._C_COMMENT_FINDER.sub("", arg_list) arg_list = arg_list.replace("\n", " ") all_args.extend(arg_list.split(",")) else: all_args.extend(arg_list) if with_offset: def vec_arg_factory(typename, name): return _VectorArgPlaceholder(typename, name, with_offset=True) else: vec_arg_factory = _VectorArgPlaceholder from pyopencl.compyte.dtypes import parse_c_arg_backend parsed_args = [] for arg in all_args: if isinstance(arg, str): arg = arg.strip() if not arg: continue ph = parse_c_arg_backend(arg, _ScalarArgPlaceholder, vec_arg_factory, name_to_dtype=lambda x: x) parsed_arg = self.render_arg(ph) elif isinstance(arg, Argument): parsed_arg = arg elif isinstance(arg, tuple): parsed_arg = ScalarArg(self.parse_type(arg[0]), arg[1]) else: raise TypeError("unexpected argument type: %s" % type(arg)) parsed_args.append(parsed_arg) return parsed_args def get_type_decl_preamble(self, device, decl_type_names, arguments=None): cdl = _CDeclList(device) for typename in decl_type_names: cdl.add_dtype(self.parse_type(typename)) if arguments is not None: cdl.visit_arguments(arguments) for _, tv in sorted(self.type_aliases.items()): cdl.add_dtype(tv) type_alias_decls = [ "typedef {} {};".format(dtype_to_ctype(val), name) for name, val in sorted(self.type_aliases.items()) ] return cdl.get_declarations() + "\n" + "\n".join(type_alias_decls) class KernelTemplateBase: def __init__(self, template_processor=None): self.template_processor = template_processor self.build_cache = {} _first_arg_dependent_caches.append(self.build_cache) def get_preamble(self): pass _TEMPLATE_PROCESSOR_PATTERN = re.compile(r"^//CL(?::([a-zA-Z0-9_]+))?//") @memoize_method def get_text_template(self, txt): proc_match = self._TEMPLATE_PROCESSOR_PATTERN.match(txt) tpl_processor = None if proc_match is not None: tpl_processor = proc_match.group(1) # chop off //CL// mark txt = txt[len(proc_match.group(0)):] if tpl_processor is None: tpl_processor = self.template_processor if tpl_processor is None or tpl_processor == "none": return _SimpleTextTemplate(txt) elif tpl_processor == "printf": return _PrintfTextTemplate(txt) elif tpl_processor == "mako": return _MakoTextTemplate(txt) else: raise RuntimeError( "unknown template processor '%s'" % proc_match.group(1)) def get_renderer(self, type_aliases, var_values, context=None, options=None): return _TemplateRenderer(self, type_aliases, var_values) def build_inner(self, context, *args, **kwargs): raise NotImplementedError def build(self, context, *args, **kwargs): """Provide caching for an :meth:`build_inner`.""" cache_key = (context, args, tuple(sorted(kwargs.items()))) try: return self.build_cache[cache_key] except KeyError: result = self.build_inner(context, *args, **kwargs) self.build_cache[cache_key] = result return result # }}} # {{{ array_module class _CLFakeArrayModule: def __init__(self, queue): self.queue = queue @property def ndarray(self): from pyopencl.array import Array return Array def dot(self, x, y): from pyopencl.array import dot return dot(x, y, queue=self.queue).get() def vdot(self, x, y): from pyopencl.array import vdot return vdot(x, y, queue=self.queue).get() def empty(self, shape, dtype, order="C"): from pyopencl.array import empty return empty(self.queue, shape, dtype, order=order) def hstack(self, arrays): from pyopencl.array import hstack return hstack(arrays, self.queue) def array_module(a): if isinstance(a, np.ndarray): return np else: from pyopencl.array import Array if isinstance(a, Array): return _CLFakeArrayModule(a.queue) else: raise TypeError("array type not understood: %s" % type(a)) # }}} def is_spirv(s): spirv_magic = b"\x07\x23\x02\x03" return ( isinstance(s, bytes) and ( s[:4] == spirv_magic or s[:4] == spirv_magic[::-1])) # {{{ numpy key types builder class _NumpyTypesKeyBuilder(KeyBuilderBase): def update_for_VectorArg(self, key_hash, key): # noqa: N802 self.rec(key_hash, key.dtype) self.update_for_str(key_hash, key.name) self.rec(key_hash, key.with_offset) def update_for_type(self, key_hash, key): if issubclass(key, np.generic): self.update_for_str(key_hash, key.__name__) return raise TypeError("unsupported type for persistent hash keying: %s" % type(key)) # }}} # vim: foldmethod=marker pyopencl-2025.1/pyopencl/version.py0000644000000000000000000000041414332717401014253 0ustar00import re from importlib import metadata VERSION_TEXT = metadata.version("pyopencl") _match = re.match(r"^([0-9.]+)([a-z0-9]*?)$", VERSION_TEXT) assert _match is not None VERSION_STATUS = _match.group(2) VERSION = tuple(int(nr) for nr in _match.group(1).split(".")) pyopencl-2025.1/pyproject.toml0000644000000000000000000001311514332717401013301 0ustar00[build-system] build-backend = "scikit_build_core.build" requires = [ "scikit-build-core >=0.9.3", "nanobind >=1.9.2", # https://numpy.org/doc/stable/dev/depending_on_numpy.html#build-time-dependency # Just depending on numpy will automatically expose the oldest supported ABI. # - Retrieved 2024-06-24, AK "numpy", ] [project] name = "pyopencl" version = "2025.1" description = "Python wrapper for OpenCL" readme = "README.rst" authors = [ { name = "Andreas Kloeckner", email = "inform@tiker.net" }, ] requires-python = "~=3.8" classifiers = [ "Development Status :: 5 - Production/Stable", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: Other Audience", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: C++", "Programming Language :: Python", "Programming Language :: Python :: 3 :: Only", "Topic :: Scientific/Engineering", "Topic :: Scientific/Engineering :: Mathematics", "Topic :: Scientific/Engineering :: Physics", ] dependencies = [ "importlib-resources; python_version<'3.9'", "numpy", "platformdirs>=2.2", "pytools>=2024.1.5", ] [project.optional-dependencies] oclgrind = [ "oclgrind-binary-distribution>=18.3", ] pocl = [ "pocl-binary-distribution>=1.2", ] test = [ "ruff", "mako", "mypy", "pylint", "pytest>=7", ] [project.urls] Documentation = "https://documen.tician.de/pyopencl" Homepage = "https://mathema.tician.de/software/pyopencl" Repository = "https://github.com/inducer/pyopencl" [tool.scikit-build] sdist.exclude = [ ".mypy_cache", ".ci", ".github", ".conda-ci-build-configure.sh", "doc/upload-docs.sh", ".editorconfig", "TODOs", "run-*.sh", ] [tool.inducer-ci-support] disable-editable-pip-install = true [tool.ruff.lint] preview = true extend-select = [ "B", # flake8-bugbear "C", # flake8-comprehensions "E", # pycodestyle "F", # pyflakes "G", # flake8-logging-format "I", # flake8-isort "N", # pep8-naming "NPY", # numpy "Q", # flake8-quotes "RUF", # ruff "UP", # pyupgrade "W", # pycodestyle ] extend-ignore = [ "E226", # missing whitespace around arithmetic operator "E241", # multiple spaces after comma "E402", # module level import not at the top of file "C90", # McCabe complexity "UP031", # use f-strings instead of % "UP032", # use f-strings instead of .format ] exclude = [ "examples/gl_interop_demo.py", "examples/gl_particle_animation.py", "pyopencl/compyte/**/*.py", ] [tool.ruff.lint.per-file-ignores] "examples/pi-monte-carlo.py" = ["N", "B", "F841"] "examples/black-hole-accretion.py" = ["N", "E501", "B"] "examples/n-body.py" = ["N", "E501"] "pyopencl/__init__.py" = ["I001"] "contrib/fortran-to-opencl/translate.py" = ["N802", "N815", "B"] [tool.ruff.lint.flake8-quotes] inline-quotes = "double" docstring-quotes = "double" multiline-quotes = "double" [tool.ruff.lint.isort] known-first-party = ["pytools", "pymbolic", "cgen"] known-local-folder = ["pyopencl"] lines-after-imports = 2 combine-as-imports = true [tool.pytest.ini_options] markers = [ "bitonic: tests involving bitonic sort" ] [tool.mypy] warn_unused_ignores = true exclude = ["pyopencl/compyte"] [[tool.mypy.overrides]] module = [ "IPython.*", "OpenGL.*", "mako.*", "matplotlib.*", "pyfmmlib.*", "pyopencl._cl.*", "pytest.*", "scipy.*", ] ignore_missing_imports = true [[tool.mypy.overrides]] module = ["pyopencl.compyte.*"] follow_imports = "skip" [tool.cibuildwheel] test-command = "pytest {project}/test" test-extras = [ "test", ] environment-pass = [ "CL_INC_DIR", "CL_LIB_DIR", ] test-skip = [ "*-macosx_*:arm64", "*-macosx_arm64", ] [tool.cibuildwheel.linux] skip = [ "pp*", "cp36-*", "cp37-*", "*_i686", ] test-command = "" before-all = [ "yum install -y git openssl-devel ruby", "bash {package}/scripts/build-ocl.sh", ] before-build = [ "pip install numpy -Csetup-args=-Dallow-noblas=true", ] repair-wheel-command = "auditwheel repair -w {dest_dir} --lib-sdir=/.libs {wheel}" [[tool.cibuildwheel.overrides]] select = "*-musllinux*" before-all = [ "apk add ruby git openssl-dev libtool", "bash {package}/scripts/build-ocl.sh", ] repair-wheel-command = "auditwheel repair -w {dest_dir} --lib-sdir=/.libs {wheel}" [tool.cibuildwheel.macos] skip = [ "pp*", "cp36-*", "cp37-*", ] before-all = "bash {package}/scripts/build-ocl-macos.sh" test-command = "pytest {project}/test/test_array.py" # same limitation as conda-forge archs = "x86_64 arm64" # https://github.com/conda-forge/pyopencl-feedstock/blob/6f3c5de59b18c9518abba3cb94f6ae92964553f8/recipe/meta.yaml#L62-L63 [tool.cibuildwheel.macos.environment] # Needed for full C++17 support MACOSX_DEPLOYMENT_TARGET = "10.14" [tool.cibuildwheel.windows] skip = [ "*-win32", "pp*", "cp36-*", "cp37-*", ] test-command = "" before-all = "bash {package}/scripts/build-ocl-windows.sh" [tool.typos.default] extend-ignore-re = [ "(?Rm)^.*(#|//)\\s*spellchecker:\\s*disable-line$" ] [tool.typos.default.extend-words] # for ND Range ND = "ND" nd = "nd" # level-of-detail LOD = "LOD" # short for 'series' "ser" = "ser" # like the numpy function "arange" = "arange" [tool.typos.files] extend-exclude = [ # No thanks, hex IDs in JSON should not be spellchecked. "examples/*.ipynb", # Copied from upstream "pyopencl/cl/pyopencl-random123/*", # This one has comments in French "examples/black-hole-accretion.py" ] pyopencl-2025.1/scripts/build-ocl-macos.sh0000644000000000000000000000163014332717401015361 0ustar00#!/usr/bin/env bash SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) set -o xtrace git clone --branch v2022.01.04 https://github.com/KhronosGroup/OpenCL-ICD-Loader git clone --branch v2022.01.04 https://github.com/KhronosGroup/OpenCL-Headers cmake -D CMAKE_INSTALL_PREFIX=./OpenCL-Headers/install -S ./OpenCL-Headers -B ./OpenCL-Headers/build cmake --build ./OpenCL-Headers/build --target install cmake -D CMAKE_PREFIX_PATH=${PWD}/OpenCL-Headers/install -D OPENCL_ICD_LOADER_HEADERS_DIR=${PWD}/OpenCL-Headers/install/include -D CMAKE_INSTALL_PREFIX=./OpenCL-ICD-Loader/install -S ./OpenCL-ICD-Loader -B ./OpenCL-ICD-Loader/build cmake --build ./OpenCL-ICD-Loader/build --target install --config Release echo "PyOpenCL wheel includes Khronos Group OpenCL-ICD-Loader which is licensed as below" >> ${SCRIPT_DIR}/../LICENSE cat ./OpenCL-ICD-Loader/LICENSE >> ${SCRIPT_DIR}/../LICENSE pyopencl-2025.1/scripts/build-ocl-windows.sh0000644000000000000000000000230314332717401015747 0ustar00#!/usr/bin/env bash SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) set -o xtrace git clone --branch v2022.01.04 https://github.com/KhronosGroup/OpenCL-ICD-Loader git clone --branch v2022.01.04 https://github.com/KhronosGroup/OpenCL-Headers cmake -D CMAKE_INSTALL_PREFIX=./OpenCL-Headers/install -S ./OpenCL-Headers -B ./OpenCL-Headers/build cmake --build ./OpenCL-Headers/build --target install # if someone would like to try to create win32 wheels the below lines may be useful # cmake -D CMAKE_PREFIX_PATH=${PWD}/OpenCL-Headers/install -DOPENCL_ICD_LOADER_HEADERS_DIR=${PWD}/OpenCL-Headers/install/include -S ./OpenCL-ICD-Loader -B ./OpenCL-ICD-Loader/build # cmake --build ./OpenCL-ICD-Loader/build --target install --config Release cmake -D CMAKE_PREFIX_PATH=${PWD}/OpenCL-Headers/install -D OPENCL_ICD_LOADER_HEADERS_DIR=${PWD}/OpenCL-Headers/install/include -S ./OpenCL-ICD-Loader -B ./OpenCL-ICD-Loader/build2 -A x64 cmake --build ./OpenCL-ICD-Loader/build2 --target install --config Release echo "PyOpenCL wheel includes Khronos Group OpenCL-ICD-Loader which is licensed as below:" >> ${SCRIPT_DIR}/../LICENSE cat ./OpenCL-ICD-Loader/LICENSE >> ${SCRIPT_DIR}/../LICENSE pyopencl-2025.1/scripts/build-ocl.sh0000644000000000000000000000143514332717401014264 0ustar00#!/usr/bin/env bash SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) set -e -x mkdir -p ~/deps cd ~/deps git clone --branch v2.3.1 https://github.com/OCL-dev/ocl-icd cd ocl-icd curl -L -O https://raw.githubusercontent.com/conda-forge/ocl-icd-feedstock/e2c03e3ddb1ff86630ccf80dc7b87a81640025ea/recipe/install-headers.patch git apply install-headers.patch curl -L -O https://github.com/isuruf/ocl-icd/commit/307f2267100a2d1383f0c4a77344b127c0857588.patch git apply 307f2267100a2d1383f0c4a77344b127c0857588.patch autoreconf -i chmod +x configure ./configure --prefix=/usr make -j4 make install # Bundle license files echo "PyOpenCL wheel includes ocl-icd which is licensed as below" >> ${SCRIPT_DIR}/../LICENSE cat ~/deps/ocl-icd/COPYING >> ${SCRIPT_DIR}/../LICENSEpyopencl-2025.1/src/bitlog.cpp0000644000000000000000000000404314332717401013140 0ustar00// Base-2 logarithm bithack // // Copyright (C) 2009 Andreas Kloeckner // Copyright (C) Sean Eron Anderson (in the public domain) // // Permission is hereby granted, free of charge, to any person // obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights to use, // copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the // Software is furnished to do so, subject to the following // conditions: // // The above copyright notice and this permission notice shall be // included in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. #include "bitlog.hpp" const char pyopencl::log_table_8[] = { 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7 }; pyopencl-2025.1/src/bitlog.hpp0000644000000000000000000000417214332717401013150 0ustar00// Base-2 logarithm bithack. // // Copyright (C) 2009 Andreas Kloeckner // Copyright (C) Sean Eron Anderson (in the public domain) // // Permission is hereby granted, free of charge, to any person // obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights to use, // copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the // Software is furnished to do so, subject to the following // conditions: // // The above copyright notice and this permission notice shall be // included in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. #ifndef _AFJDFJSDFSD_PYOPENCL_HEADER_SEEN_BITLOG_HPP #define _AFJDFJSDFSD_PYOPENCL_HEADER_SEEN_BITLOG_HPP #include #include namespace pyopencl { /* from http://graphics.stanford.edu/~seander/bithacks.html */ extern const char log_table_8[]; inline unsigned bitlog2_16(uint16_t v) { if (unsigned long t = v >> 8) return 8+log_table_8[t]; else return log_table_8[v]; } inline unsigned bitlog2_32(uint32_t v) { if (uint16_t t = v >> 16) return 16+bitlog2_16(t); else return bitlog2_16(v); } #if defined(UINT64_MAX) inline unsigned bitlog2(uint64_t v) { if (uint32_t t = v >> 32) return 32+bitlog2_32(t); else return bitlog2_32(v); } #else inline unsigned bitlog2(unsigned long v) { #if (ULONG_MAX != 4294967295) if (uint32_t t = v >> 32) return 32+bitlog2_32(t); else #endif return bitlog2_32(v); } #endif } #endif pyopencl-2025.1/src/clinfo_ext.h0000644000000000000000000001067614332717401013470 0ustar00/* Include OpenCL header, and define OpenCL extensions, since what is and is not * available in the official headers is very system-dependent */ #ifndef _EXT_H #define _EXT_H #if (defined(__APPLE__) && !defined(PYOPENCL_APPLE_USE_CL_H)) #include #else #include #endif /* These two defines were introduced in the 1.2 headers * on 2012-11-30, so earlier versions don't have them * (e.g. Debian wheezy) */ #ifndef CL_DEVICE_IMAGE_PITCH_ALIGNMENT #define CL_DEVICE_IMAGE_PITCH_ALIGNMENT 0x104A #define CL_DEVICE_IMAGE_BASE_ADDRESS_ALIGNMENT 0x104B #endif /* * Extensions */ /* cl_khr_icd */ #define CL_PLATFORM_ICD_SUFFIX_KHR 0x0920 #define CL_PLATFORM_NOT_FOUND_KHR -1001 /* cl_khr_fp64 */ #define CL_DEVICE_DOUBLE_FP_CONFIG 0x1032 /* cl_khr_fp16 */ #define CL_DEVICE_HALF_FP_CONFIG 0x1033 /* cl_khr_terminate_context */ #define CL_DEVICE_TERMINATE_CAPABILITY_KHR 0x200F /* cl_nv_device_attribute_query */ #define CL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV 0x4000 #define CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV 0x4001 #define CL_DEVICE_REGISTERS_PER_BLOCK_NV 0x4002 #define CL_DEVICE_WARP_SIZE_NV 0x4003 #define CL_DEVICE_GPU_OVERLAP_NV 0x4004 #define CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV 0x4005 #define CL_DEVICE_INTEGRATED_MEMORY_NV 0x4006 #define CL_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT_NV 0x4007 #define CL_DEVICE_PCI_BUS_ID_NV 0x4008 #define CL_DEVICE_PCI_SLOT_ID_NV 0x4009 #define CL_DEVICE_PCI_DOMAIN_ID_NV 0x400A /* cl_ext_atomic_counters_{32,64} */ #define CL_DEVICE_MAX_ATOMIC_COUNTERS_EXT 0x4032 /* cl_amd_device_attribute_query */ #define CL_DEVICE_PROFILING_TIMER_OFFSET_AMD 0x4036 #define CL_DEVICE_TOPOLOGY_AMD 0x4037 #define CL_DEVICE_BOARD_NAME_AMD 0x4038 #define CL_DEVICE_GLOBAL_FREE_MEMORY_AMD 0x4039 #define CL_DEVICE_SIMD_PER_COMPUTE_UNIT_AMD 0x4040 #define CL_DEVICE_SIMD_WIDTH_AMD 0x4041 #define CL_DEVICE_SIMD_INSTRUCTION_WIDTH_AMD 0x4042 #define CL_DEVICE_WAVEFRONT_WIDTH_AMD 0x4043 #define CL_DEVICE_GLOBAL_MEM_CHANNELS_AMD 0x4044 #define CL_DEVICE_GLOBAL_MEM_CHANNEL_BANKS_AMD 0x4045 #define CL_DEVICE_GLOBAL_MEM_CHANNEL_BANK_WIDTH_AMD 0x4046 #define CL_DEVICE_LOCAL_MEM_SIZE_PER_COMPUTE_UNIT_AMD 0x4047 #define CL_DEVICE_LOCAL_MEM_BANKS_AMD 0x4048 #define CL_DEVICE_THREAD_TRACE_SUPPORTED_AMD 0x4049 #define CL_DEVICE_GFXIP_MAJOR_AMD 0x404A #define CL_DEVICE_GFXIP_MINOR_AMD 0x404B #define CL_DEVICE_AVAILABLE_ASYNC_QUEUES_AMD 0x404C #define CL_DEVICE_PREFERRED_WORK_GROUP_SIZE_AMD 0x4030 #define CL_DEVICE_MAX_WORK_GROUP_SIZE_AMD 0x4031 #define CL_DEVICE_PREFERRED_CONSTANT_BUFFER_SIZE_AMD 0x4033 #define CL_DEVICE_PCIE_ID_AMD 0x4034 #ifndef CL_DEVICE_TOPOLOGY_TYPE_PCIE_AMD #define CL_DEVICE_TOPOLOGY_TYPE_PCIE_AMD 1 typedef union { struct { cl_uint type; cl_uint data[5]; } raw; struct { cl_uint type; cl_char unused[17]; cl_char bus; cl_char device; cl_char function; } pcie; } cl_device_topology_amd; #endif /* cl_amd_offline_devices */ #define CL_CONTEXT_OFFLINE_DEVICES_AMD 0x403F /* cl_ext_device_fission */ #define cl_ext_device_fission 1 typedef cl_ulong cl_device_partition_property_ext; #define CL_DEVICE_PARTITION_EQUALLY_EXT 0x4050 #define CL_DEVICE_PARTITION_BY_COUNTS_EXT 0x4051 #define CL_DEVICE_PARTITION_BY_NAMES_EXT 0x4052 #define CL_DEVICE_PARTITION_BY_NAMES_INTEL 0x4052 /* cl_intel_device_partition_by_names */ #define CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT 0x4053 #define CL_DEVICE_PARENT_DEVICE_EXT 0x4054 #define CL_DEVICE_PARTITION_TYPES_EXT 0x4055 #define CL_DEVICE_AFFINITY_DOMAINS_EXT 0x4056 #define CL_DEVICE_REFERENCE_COUNT_EXT 0x4057 #define CL_DEVICE_PARTITION_STYLE_EXT 0x4058 #define CL_AFFINITY_DOMAIN_L1_CACHE_EXT 0x1 #define CL_AFFINITY_DOMAIN_L2_CACHE_EXT 0x2 #define CL_AFFINITY_DOMAIN_L3_CACHE_EXT 0x3 #define CL_AFFINITY_DOMAIN_L4_CACHE_EXT 0x4 #define CL_AFFINITY_DOMAIN_NUMA_EXT 0x10 #define CL_AFFINITY_DOMAIN_NEXT_FISSIONABLE_EXT 0x100 /* cl_intel_advanced_motion_estimation */ #define CL_DEVICE_ME_VERSION_INTEL 0x407E /* cl_qcom_ext_host_ptr */ #define CL_DEVICE_EXT_MEM_PADDING_IN_BYTES_QCOM 0x40A0 #define CL_DEVICE_PAGE_SIZE_QCOM 0x40A1 /* cl_khr_spir */ #define CL_DEVICE_SPIR_VERSIONS 0x40E0 /* cl_altera_device_temperature */ #define CL_DEVICE_CORE_TEMPERATURE_ALTERA 0x40F3 /* cl_intel_simultaneous_sharing */ #define CL_DEVICE_SIMULTANEOUS_INTEROPS_INTEL 0x4104 #define CL_DEVICE_NUM_SIMULTANEOUS_INTEROPS_INTEL 0x4105 #endif pyopencl-2025.1/src/mempool.hpp0000644000000000000000000002733214332717401013343 0ustar00// Abstract memory pool implementation // // Copyright (C) 2009-17 Andreas Kloeckner // // Permission is hereby granted, free of charge, to any person // obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights to use, // copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the // Software is furnished to do so, subject to the following // conditions: // // The above copyright notice and this permission notice shall be // included in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. #ifndef _AFJDFJSDFSD_PYGPU_HEADER_SEEN_MEMPOOL_HPP #define _AFJDFJSDFSD_PYGPU_HEADER_SEEN_MEMPOOL_HPP #include #include #include #include #include #include #include "bitlog.hpp" #ifndef PYGPU_PYCUDA #include #include namespace nb = nanobind; #endif namespace PYGPU_PACKAGE { // https://stackoverflow.com/a/44175911 class mp_noncopyable { public: mp_noncopyable() = default; ~mp_noncopyable() = default; private: mp_noncopyable(const mp_noncopyable&) = delete; mp_noncopyable& operator=(const mp_noncopyable&) = delete; }; #ifdef PYGPU_PYCUDA #define PYGPU_SHARED_PTR boost::shared_ptr #else #define PYGPU_SHARED_PTR nb::ref #endif template inline T signed_left_shift(T x, signed shift_amount) { if (shift_amount < 0) return x >> -shift_amount; else return x << shift_amount; } template inline T signed_right_shift(T x, signed shift_amount) { if (shift_amount < 0) return x << -shift_amount; else return x >> shift_amount; } #define always_assert(cond) \ do { \ if (!(cond)) \ throw std::logic_error("mem pool assertion violated: " #cond); \ } while (false); template class memory_pool : mp_noncopyable #ifndef PYGPU_PYCUDA , public nb::intrusive_base #endif { public: typedef typename Allocator::pointer_type pointer_type; typedef typename Allocator::size_type size_type; private: typedef uint32_t bin_nr_t; typedef std::vector bin_t; typedef std::map container_t; container_t m_container; typedef typename container_t::value_type bin_pair_t; PYGPU_SHARED_PTR m_allocator; // A held block is one that's been released by the application, but that // we are keeping around to dish out again. size_type m_held_blocks; // An active block is one that is in use by the application. size_type m_active_blocks; // "Managed" memory is "active" and "held" memory. size_type m_managed_bytes; // "Active" bytes are bytes under the control of the application. // This may be smaller than the actual allocated size reflected // in m_managed_bytes. size_type m_active_bytes; bool m_stop_holding; int m_trace; unsigned m_leading_bits_in_bin_id; public: memory_pool(PYGPU_SHARED_PTR alloc, unsigned leading_bits_in_bin_id=4) : m_allocator(alloc), m_held_blocks(0), m_active_blocks(0), m_managed_bytes(0), m_active_bytes(0), m_stop_holding(false), m_trace(false), m_leading_bits_in_bin_id(leading_bits_in_bin_id) { if (m_allocator->is_deferred()) { PyErr_WarnEx(PyExc_UserWarning, "Memory pools expect non-deferred " "semantics from their allocators. You passed a deferred " "allocator, i.e. an allocator whose allocations can turn out to " "be unavailable long after allocation.", 1); } } virtual ~memory_pool() { free_held(); } private: unsigned mantissa_mask() const { return (1 << m_leading_bits_in_bin_id) - 1; } public: bin_nr_t bin_number(size_type size) { signed l = bitlog2(size); size_type shifted = signed_right_shift(size, l-signed(m_leading_bits_in_bin_id)); if (size && (shifted & (1 << m_leading_bits_in_bin_id)) == 0) throw std::runtime_error("memory_pool::bin_number: bitlog2 fault"); size_type chopped = shifted & mantissa_mask(); return l << m_leading_bits_in_bin_id | chopped; } void set_trace(bool flag) { if (flag) ++m_trace; else --m_trace; } size_type alloc_size(bin_nr_t bin) { bin_nr_t exponent = bin >> m_leading_bits_in_bin_id; bin_nr_t mantissa = bin & mantissa_mask(); size_type ones = signed_left_shift((size_type) 1, signed(exponent)-signed(m_leading_bits_in_bin_id) ); if (ones) ones -= 1; size_type head = signed_left_shift( (size_type) ((1<second; } else return it->second; } void inc_held_blocks() { if (m_held_blocks == 0) start_holding_blocks(); ++m_held_blocks; } void dec_held_blocks() { --m_held_blocks; if (m_held_blocks == 0) stop_holding_blocks(); } virtual void start_holding_blocks() { } virtual void stop_holding_blocks() { } public: pointer_type allocate(size_type size) { bin_nr_t bin_nr = bin_number(size); bin_t &bin = get_bin(bin_nr); if (bin.size()) { if (m_trace) std::cout << "[pool] allocation of size " << size << " served from bin " << bin_nr << " which contained " << bin.size() << " entries" << std::endl; return m_allocator->hand_out_existing_block( pop_block_from_bin(bin, size)); } size_type alloc_sz = alloc_size(bin_nr); always_assert(bin_number(alloc_sz) == bin_nr); always_assert(alloc_sz >= size); if (m_trace) std::cout << "[pool] allocation of size " << size << " required new memory" << std::endl; try { return get_from_allocator(alloc_sz, size); } catch (PYGPU_PACKAGE::error &e) { if (!e.is_out_of_memory()) throw; } if (m_trace) std::cout << "[pool] allocation triggered OOM, running GC" << std::endl; m_allocator->try_release_blocks(); if (bin.size()) return m_allocator->hand_out_existing_block( pop_block_from_bin(bin, size)); if (m_trace) std::cout << "[pool] allocation still OOM after GC" << std::endl; while (try_to_free_memory()) { try { return get_from_allocator(alloc_sz, size); } catch (PYGPU_PACKAGE::error &e) { if (!e.is_out_of_memory()) throw; } } throw PYGPU_PACKAGE::error( "memory_pool::allocate", #ifdef PYGPU_PYCUDA CUDA_ERROR_OUT_OF_MEMORY, #endif #ifdef PYGPU_PYOPENCL CL_MEM_OBJECT_ALLOCATION_FAILURE, #endif "failed to free memory for allocation"); } void free(pointer_type &&p, size_type size) { --m_active_blocks; m_active_bytes -= size; bin_nr_t bin_nr = bin_number(size); if (!m_stop_holding) { inc_held_blocks(); get_bin(bin_nr).push_back(std::move(p)); if (m_trace) std::cout << "[pool] block of size " << size << " returned to bin " << bin_nr << " which now contains " << get_bin(bin_nr).size() << " entries" << std::endl; } else { m_allocator->free(std::move(p)); m_managed_bytes -= alloc_size(bin_nr); } } void free_held() { for (bin_pair_t &bin_pair: m_container) { bin_t &bin = bin_pair.second; while (bin.size()) { m_allocator->free(std::move(bin.back())); m_managed_bytes -= alloc_size(bin_pair.first); bin.pop_back(); dec_held_blocks(); } } assert(m_held_blocks == 0); } void stop_holding() { m_stop_holding = true; free_held(); } size_type active_blocks() const { return m_active_blocks; } size_type held_blocks() const { return m_held_blocks; } size_type managed_bytes() const { return m_managed_bytes; } size_type active_bytes() const { return m_active_bytes; } bool try_to_free_memory() { // free largest stuff first for (typename container_t::reverse_iterator it = m_container.rbegin(); it != m_container.rend(); ++it) { bin_pair_t &bin_pair = *it; bin_t &bin = bin_pair.second; if (bin.size()) { m_allocator->free(std::move(bin.back())); m_managed_bytes -= alloc_size(bin_pair.first); bin.pop_back(); dec_held_blocks(); return true; } } return false; } private: pointer_type get_from_allocator(size_type alloc_sz, size_type size) { pointer_type result = m_allocator->allocate(alloc_sz); ++m_active_blocks; m_managed_bytes += alloc_sz; m_active_bytes += size; return result; } pointer_type pop_block_from_bin(bin_t &bin, size_type size) { pointer_type result(std::move(bin.back())); bin.pop_back(); dec_held_blocks(); ++m_active_blocks; m_active_bytes += size; return result; } }; template class pooled_allocation : public mp_noncopyable { public: typedef Pool pool_type; typedef typename Pool::pointer_type pointer_type; typedef typename Pool::size_type size_type; protected: PYGPU_SHARED_PTR m_pool; pointer_type m_ptr; size_type m_size; bool m_valid; public: pooled_allocation(PYGPU_SHARED_PTR p, size_type size) : m_pool(p), m_ptr(p->allocate(size)), m_size(size), m_valid(true) { } ~pooled_allocation() { if (m_valid) free(); } void free() { if (m_valid) { m_pool->free(std::move(m_ptr), m_size); m_valid = false; } else throw PYGPU_PACKAGE::error( "pooled_device_allocation::free", #ifdef PYGPU_PYCUDA CUDA_ERROR_INVALID_HANDLE #endif #ifdef PYGPU_PYOPENCL CL_INVALID_VALUE #endif ); } }; } #endif pyopencl-2025.1/src/pyopencl_ext.h0000644000000000000000000000407514332717401014043 0ustar00#ifndef _PYOPENCL_EXT_H #define _PYOPENCL_EXT_H #ifdef PYOPENCL_USE_SHIPPED_EXT #include "clinfo_ext.h" #else #if (defined(__APPLE__) && !defined(PYOPENCL_APPLE_USE_CL_H)) #include #else #include #include #endif #ifndef CL_DEVICE_TOPOLOGY_TYPE_PCIE_AMD #define CL_DEVICE_TOPOLOGY_TYPE_PCIE_AMD 1 typedef union { struct { cl_uint type; cl_uint data[5]; } raw; struct { cl_uint type; cl_char unused[17]; cl_char bus; cl_char device; cl_char function; } pcie; } cl_device_topology_amd; #endif #ifndef CL_DEVICE_P2P_DEVICES_AMD #define CL_DEVICE_P2P_DEVICES_AMD 0x4089 typedef CL_API_ENTRY cl_int (CL_API_CALL * clEnqueueCopyBufferP2PAMD_fn)(cl_command_queue /*command_queue*/, cl_mem /*src_buffer*/, cl_mem /*dst_buffer*/, size_t /*src_offset*/, size_t /*dst_offset*/, size_t /*cb*/, cl_uint /*num_events_in_wait_list*/, const cl_event* /*event_wait_list*/, cl_event* /*event*/); #endif /* {{{ these NV defines are often missing from the system headers */ #ifndef CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV #define CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV 0x4005 #endif #ifndef CL_DEVICE_INTEGRATED_MEMORY_NV #define CL_DEVICE_INTEGRATED_MEMORY_NV 0x4006 #endif #ifndef CL_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT_NV #define CL_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT_NV 0x4007 #endif #ifndef CL_DEVICE_PCI_BUS_ID_NV #define CL_DEVICE_PCI_BUS_ID_NV 0x4008 #endif #ifndef CL_DEVICE_PCI_SLOT_ID_NV #define CL_DEVICE_PCI_SLOT_ID_NV 0x4009 #endif #ifndef CL_DEVICE_PCI_DOMAIN_ID_NV #define CL_DEVICE_PCI_DOMAIN_ID_NV 0x400A #endif /* }}} */ #endif #endif /* vim: foldmethod=marker */ pyopencl-2025.1/src/tools.hpp0000644000000000000000000000445414332717401013033 0ustar00// Various odds and ends // // Copyright (C) 2009 Andreas Kloeckner // // Permission is hereby granted, free of charge, to any person // obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights to use, // copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the // Software is furnished to do so, subject to the following // conditions: // // The above copyright notice and this permission notice shall be // included in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. #ifndef _ASDFDAFVVAFF_PYCUDA_HEADER_SEEN_TOOLS_HPP #define _ASDFDAFVVAFF_PYCUDA_HEADER_SEEN_TOOLS_HPP #include #include #include namespace pyopencl { inline npy_intp size_from_dims(int ndim, const npy_intp *dims) { if (ndim != 0) return std::accumulate(dims, dims+ndim, 1, std::multiplies()); else return 1; } inline void run_python_gc() { namespace py = nanobind; py::module_::import_("gc").attr("collect")(); } // https://stackoverflow.com/a/28139075 template struct reversion_wrapper { T& iterable; }; template auto begin (reversion_wrapper w) { return w.iterable.rbegin(); } template auto end (reversion_wrapper w) { return w.iterable.rend(); } template reversion_wrapper reverse (T&& iterable) { return { iterable }; } // https://stackoverflow.com/a/44175911 class noncopyable { public: noncopyable() = default; ~noncopyable() = default; private: noncopyable(const noncopyable&) = delete; noncopyable& operator=(const noncopyable&) = delete; }; } #endif pyopencl-2025.1/src/wrap_cl.cpp0000644000000000000000000000421614332717401013311 0ustar00// PyOpenCL-flavored C++ wrapper of the CL API // // Copyright (C) 2009 Andreas Kloeckner // // Permission is hereby granted, free of charge, to any person // obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights to use, // copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the // Software is furnished to do so, subject to the following // conditions: // // The above copyright notice and this permission notice shall be // included in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. #define PY_ARRAY_UNIQUE_SYMBOL pyopencl_ARRAY_API #include "wrap_cl.hpp" #include using namespace pyopencl; extern void pyopencl_expose_constants(py::module_ &m); extern void pyopencl_expose_part_1(py::module_ &m); extern void pyopencl_expose_part_2(py::module_ &m); extern void pyopencl_expose_mempool(py::module_ &m); static bool import_numpy_helper() { import_array1(false); return true; } NB_MODULE(_cl, m) { py::intrusive_init( [](PyObject *o) noexcept { py::gil_scoped_acquire guard; Py_INCREF(o); }, [](PyObject *o) noexcept { py::gil_scoped_acquire guard; Py_DECREF(o); }); if (!import_numpy_helper()) throw py::python_error(); pyopencl_expose_constants(m); pyopencl_expose_part_1(m); pyopencl_expose_part_2(m); pyopencl_expose_mempool(m); #ifdef NDEBUG // See https://github.com/inducer/pyopencl/issues/758 for context. py::set_leak_warnings(false); #endif } // vim: foldmethod=marker pyopencl-2025.1/src/wrap_cl.hpp0000644000000000000000000050713514332717401013326 0ustar00// PyOpenCL-flavored C++ wrapper of the CL API // // Copyright (C) 2009 Andreas Kloeckner // // Permission is hereby granted, free of charge, to any person // obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights to use, // copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the // Software is furnished to do so, subject to the following // conditions: // // The above copyright notice and this permission notice shall be // included in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. #ifndef _AFJHAYYTA_PYOPENCL_HEADER_SEEN_WRAP_CL_HPP #define _AFJHAYYTA_PYOPENCL_HEADER_SEEN_WRAP_CL_HPP // CL 1.2 undecided: // clSetPrintfCallback // CL 2.0 complete // CL 2.1 complete // CL 2.2 complete // CL 3.0 missing: // clCreateBufferWithProperties // clCreateImageWithProperties // (no wrappers for now: OpenCL 3.0 does not define any optional properties for // buffers or images, no implementations to test with.) // {{{ includes #define CL_USE_DEPRECATED_OPENCL_1_1_APIS // #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION #ifdef __APPLE__ // Mac ------------------------------------------------------------------------ #include #include "pyopencl_ext.h" #ifdef HAVE_GL #define PYOPENCL_GL_SHARING_VERSION 1 #include #include #include #endif #else // elsewhere ------------------------------------------------------------------ #define CL_TARGET_OPENCL_VERSION 300 #include #include "pyopencl_ext.h" #if defined(_WIN32) #define NOMINMAX #include #endif #ifdef HAVE_GL #include #include #endif #if defined(cl_khr_gl_sharing) && (cl_khr_gl_sharing >= 1) #define PYOPENCL_GL_SHARING_VERSION cl_khr_gl_sharing #endif #endif #include #include #include #include #include #include #include #include #include #include #include #include "wrap_helpers.hpp" #include #include "tools.hpp" #ifdef PYOPENCL_PRETEND_CL_VERSION #define PYOPENCL_CL_VERSION PYOPENCL_PRETEND_CL_VERSION #else #if defined(CL_VERSION_3_0) #define PYOPENCL_CL_VERSION 0x3000 #elif defined(CL_VERSION_2_2) #define PYOPENCL_CL_VERSION 0x2020 #elif defined(CL_VERSION_2_1) #define PYOPENCL_CL_VERSION 0x2010 #elif defined(CL_VERSION_2_0) #define PYOPENCL_CL_VERSION 0x2000 #elif defined(CL_VERSION_1_2) #define PYOPENCL_CL_VERSION 0x1020 #elif defined(CL_VERSION_1_1) #define PYOPENCL_CL_VERSION 0x1010 #else #define PYOPENCL_CL_VERSION 0x1000 #endif #endif #if defined(_WIN32) // MSVC does not understand variable-length arrays #define PYOPENCL_STACK_CONTAINER(TYPE, NAME, COUNT) std::vector NAME(COUNT) #define PYOPENCL_STACK_CONTAINER_GET_PTR(NAME) (NAME.size() ? NAME.data() : nullptr) #else // gcc et al complain about stripping attributes in template arguments #define PYOPENCL_STACK_CONTAINER(TYPE, NAME, COUNT) TYPE NAME[COUNT] #define PYOPENCL_STACK_CONTAINER_GET_PTR(NAME) NAME #endif // }}} // {{{ macros and typedefs for wrappers #if NPY_ABI_VERSION < 0x02000000 #define PyDataType_ELSIZE(descr) ((descr)->elsize) #endif #if PY_VERSION_HEX >= 0x02050000 typedef Py_ssize_t PYOPENCL_BUFFER_SIZE_T; #else typedef int PYOPENCL_BUFFER_SIZE_T; #endif #define PYOPENCL_CAST_BOOL(B) ((B) ? CL_TRUE : CL_FALSE) #define PYOPENCL_DEPRECATED(WHAT, KILL_VERSION, EXTRA_MSG) \ { \ PyErr_Warn( \ PyExc_DeprecationWarning, \ WHAT " is deprecated and will stop working in PyOpenCL " KILL_VERSION". " \ EXTRA_MSG); \ } #if PYOPENCL_CL_VERSION >= 0x1020 #define PYOPENCL_GET_EXT_FUN(PLATFORM, NAME, VAR) \ NAME##_fn VAR \ = (NAME##_fn) \ clGetExtensionFunctionAddressForPlatform(PLATFORM, #NAME); \ \ if (!VAR) \ throw error(#NAME, CL_INVALID_VALUE, #NAME \ "not available"); #else #define PYOPENCL_GET_EXT_FUN(PLATFORM, NAME, VAR) \ NAME##_fn VAR \ = (NAME##_fn) \ clGetExtensionFunctionAddress(#NAME); \ \ if (!VAR) \ throw error(#NAME, CL_INVALID_VALUE, #NAME \ "not available"); #endif #define PYOPENCL_PARSE_PY_DEVICES \ std::vector devices_vec; \ cl_uint num_devices; \ cl_device_id *devices; \ \ if (py_devices.ptr() == Py_None) \ { \ num_devices = 0; \ devices = 0; \ } \ else \ { \ for (py::handle py_dev: py_devices) \ devices_vec.push_back( \ py::cast(py_dev).data()); \ num_devices = devices_vec.size(); \ devices = devices_vec.empty( ) ? nullptr : &devices_vec.front(); \ } \ #define PYOPENCL_RETRY_RETURN_IF_MEM_ERROR(OPERATION) \ try \ { \ OPERATION \ } \ catch (pyopencl::error &e) \ { \ if (!e.is_out_of_memory()) \ throw; \ } \ \ /* If we get here, we got an error from CL. * We should run the Python GC to try and free up * some memory references. */ \ run_python_gc(); \ \ /* Now retry the allocation. If it fails again, * let it fail. */ \ { \ OPERATION \ } #define PYOPENCL_RETRY_IF_MEM_ERROR(OPERATION) \ { \ bool failed_with_mem_error = false; \ try \ { \ OPERATION \ } \ catch (pyopencl::error &e) \ { \ failed_with_mem_error = true; \ if (!e.is_out_of_memory()) \ throw; \ } \ \ if (failed_with_mem_error) \ { \ /* If we get here, we got an error from CL. * We should run the Python GC to try and free up * some memory references. */ \ run_python_gc(); \ \ /* Now retry the allocation. If it fails again, * let it fail. */ \ { \ OPERATION \ } \ } \ } #define PYOPENCL_GET_SVM_SIZE(NAME) \ size_t NAME##_size; \ bool NAME##_has_size = false; \ try \ { \ NAME##_size = NAME.size(); \ NAME##_has_size = true; \ } \ catch (size_not_available) { } // }}} // {{{ tracing and error reporting #ifdef PYOPENCL_TRACE #define PYOPENCL_PRINT_CALL_TRACE(NAME) \ std::cerr << NAME << std::endl; #define PYOPENCL_PRINT_CALL_TRACE_INFO(NAME, EXTRA_INFO) \ std::cerr << NAME << " (" << EXTRA_INFO << ')' << std::endl; #else #define PYOPENCL_PRINT_CALL_TRACE(NAME) /*nothing*/ #define PYOPENCL_PRINT_CALL_TRACE_INFO(NAME, EXTRA_INFO) /*nothing*/ #endif #define PYOPENCL_CALL_GUARDED_THREADED_WITH_TRACE_INFO(NAME, ARGLIST, TRACE_INFO) \ { \ PYOPENCL_PRINT_CALL_TRACE_INFO(#NAME, TRACE_INFO); \ cl_int status_code; \ { \ py::gil_scoped_release release; \ status_code = NAME ARGLIST; \ } \ if (status_code != CL_SUCCESS) \ throw pyopencl::error(#NAME, status_code);\ } #define PYOPENCL_CALL_GUARDED_WITH_TRACE_INFO(NAME, ARGLIST, TRACE_INFO) \ { \ PYOPENCL_PRINT_CALL_TRACE_INFO(#NAME, TRACE_INFO); \ cl_int status_code; \ status_code = NAME ARGLIST; \ if (status_code != CL_SUCCESS) \ throw pyopencl::error(#NAME, status_code);\ } #define PYOPENCL_CALL_GUARDED_THREADED(NAME, ARGLIST) \ { \ PYOPENCL_PRINT_CALL_TRACE(#NAME); \ cl_int status_code; \ { \ py::gil_scoped_release release; \ status_code = NAME ARGLIST; \ } \ if (status_code != CL_SUCCESS) \ throw pyopencl::error(#NAME, status_code);\ } #define PYOPENCL_CALL_GUARDED(NAME, ARGLIST) \ { \ PYOPENCL_PRINT_CALL_TRACE(#NAME); \ cl_int status_code; \ status_code = NAME ARGLIST; \ if (status_code != CL_SUCCESS) \ throw pyopencl::error(#NAME, status_code);\ } #define PYOPENCL_CALL_GUARDED_CLEANUP(NAME, ARGLIST) \ { \ PYOPENCL_PRINT_CALL_TRACE(#NAME); \ cl_int status_code; \ status_code = NAME ARGLIST; \ if (status_code != CL_SUCCESS) \ std::cerr \ << "PyOpenCL WARNING: a clean-up operation failed (dead context maybe?)" \ << std::endl \ << #NAME " failed with code " << status_code \ << std::endl; \ } // }}} // {{{ get_info helpers #define PYOPENCL_GET_OPAQUE_INFO(WHAT, FIRST_ARG, SECOND_ARG, CL_TYPE, TYPE) \ { \ CL_TYPE param_value; \ PYOPENCL_CALL_GUARDED(clGet##WHAT##Info, \ (FIRST_ARG, SECOND_ARG, sizeof(param_value), ¶m_value, 0)); \ if (param_value) \ return py::object(handle_from_new_ptr( \ new TYPE(param_value, /*retain*/ true))); \ else \ return py::none(); \ } #define PYOPENCL_GET_VEC_INFO(WHAT, FIRST_ARG, SECOND_ARG, RES_VEC) \ { \ size_t size; \ PYOPENCL_CALL_GUARDED(clGet##WHAT##Info, \ (FIRST_ARG, SECOND_ARG, 0, 0, &size)); \ \ RES_VEC.resize(size / sizeof(RES_VEC.front())); \ \ PYOPENCL_CALL_GUARDED(clGet##WHAT##Info, \ (FIRST_ARG, SECOND_ARG, size, \ RES_VEC.empty( ) ? nullptr : &RES_VEC.front(), &size)); \ } #define PYOPENCL_GET_STR_INFO(WHAT, FIRST_ARG, SECOND_ARG) \ { \ size_t param_value_size; \ PYOPENCL_CALL_GUARDED(clGet##WHAT##Info, \ (FIRST_ARG, SECOND_ARG, 0, 0, ¶m_value_size)); \ \ std::vector param_value(param_value_size); \ PYOPENCL_CALL_GUARDED(clGet##WHAT##Info, \ (FIRST_ARG, SECOND_ARG, param_value_size, \ param_value.empty( ) ? nullptr : ¶m_value.front(), ¶m_value_size)); \ \ return py::cast( \ param_value.empty( ) ? "" : std::string(¶m_value.front(), param_value_size-1)); \ } #define PYOPENCL_GET_TYPED_INFO(WHAT, FIRST_ARG, SECOND_ARG, TYPE) \ { \ TYPE param_value; \ PYOPENCL_CALL_GUARDED(clGet##WHAT##Info, \ (FIRST_ARG, SECOND_ARG, sizeof(param_value), ¶m_value, 0)); \ return py::cast(param_value); \ } // }}} // {{{ event helpers -------------------------------------------------------------- #define PYOPENCL_PARSE_WAIT_FOR \ cl_uint num_events_in_wait_list = 0; \ std::vector event_wait_list; \ \ if (py_wait_for.ptr() != Py_None) \ { \ for (py::handle evt: py_wait_for) \ { \ event_wait_list.push_back(py::cast(evt).data()); \ ++num_events_in_wait_list; \ } \ } #define PYOPENCL_WAITLIST_ARGS \ num_events_in_wait_list, (num_events_in_wait_list == 0) ? nullptr : &event_wait_list.front() #define PYOPENCL_RETURN_NEW_NANNY_EVENT(evt, obj) \ try \ { \ return new nanny_event(evt, false, obj); \ } \ catch (...) \ { \ clReleaseEvent(evt); \ throw; \ } #define PYOPENCL_RETURN_NEW_EVENT(evt) \ try \ { \ return new event(evt, false); \ } \ catch (...) \ { \ clReleaseEvent(evt); \ throw; \ } // }}} // {{{ equality testing #define PYOPENCL_EQUALITY_TESTS(cls) \ bool operator==(cls const &other) const \ { return data() == other.data(); } \ bool operator!=(cls const &other) const \ { return data() != other.data(); } \ long hash() const \ { return (long) (intptr_t) data(); } // }}} namespace pyopencl { using namespace py::literals; class program; class command_queue; // {{{ error class error : public std::runtime_error { private: std::string m_routine; cl_int m_code; // This is here because clLinkProgram returns a program // object *just* so that there is somewhere for it to // stuff the linker logs. :/ bool m_program_initialized; cl_program m_program; public: error(std::string const &routine, cl_int c, std::string const &msg="") : std::runtime_error(msg), m_routine(routine), m_code(c), m_program_initialized(false), m_program(nullptr) { } error(const char *routine, cl_program prg, cl_int c, const char *msg="") : std::runtime_error(msg), m_routine(routine), m_code(c), m_program_initialized(true), m_program(prg) { } virtual ~error() { if (m_program_initialized) clReleaseProgram(m_program); } const std::string &routine() const { return m_routine; } cl_int code() const { return m_code; } bool is_out_of_memory() const { return (code() == CL_MEM_OBJECT_ALLOCATION_FAILURE || code() == CL_OUT_OF_RESOURCES || code() == CL_OUT_OF_HOST_MEMORY); } program *get_program() const; // FIXME: Inheritance from builtin_exception confuses nanobind const char *err_what() { return what(); } void set_error() const { py::object err_obj = py::cast(*this); py::object errors_mod = py::module_::import_("pyopencl._errors"); if (code() == CL_MEM_OBJECT_ALLOCATION_FAILURE) PyErr_SetObject(errors_mod.attr("MemoryError").ptr(), err_obj.ptr()); else if (code() <= CL_INVALID_VALUE) PyErr_SetObject(errors_mod.attr("LogicError").ptr(), err_obj.ptr()); else if (code() > CL_INVALID_VALUE && code() < CL_SUCCESS) PyErr_SetObject(errors_mod.attr("RuntimeError").ptr(), err_obj.ptr()); else PyErr_SetObject(errors_mod.attr("Error").ptr(), err_obj.ptr()); } }; // }}} // {{{ utility functions inline bool is_queue_out_of_order(cl_command_queue queue) { cl_command_queue_properties param_value; PYOPENCL_CALL_GUARDED(clGetCommandQueueInfo, (queue, CL_QUEUE_PROPERTIES, sizeof(param_value), ¶m_value, 0)); return param_value & CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE; } // }}} // {{{ buffer interface helper class py_buffer_wrapper : public noncopyable { private: bool m_initialized; public: Py_buffer m_buf; py_buffer_wrapper() : m_initialized(false) {} void get(PyObject *obj, int flags) { #ifdef PYPY_VERSION // work around https://bitbucket.org/pypy/pypy/issues/2873 if (flags & PyBUF_ANY_CONTIGUOUS) { int flags_wo_cont = flags & ~PyBUF_ANY_CONTIGUOUS; if (PyObject_GetBuffer(obj, &m_buf, flags_wo_cont | PyBUF_C_CONTIGUOUS)) { PyErr_Clear(); if (PyObject_GetBuffer(obj, &m_buf, flags_wo_cont | PyBUF_F_CONTIGUOUS)) throw py::python_error(); } } else #endif if (PyObject_GetBuffer(obj, &m_buf, flags)) throw py::python_error(); m_initialized = true; } virtual ~py_buffer_wrapper() { if (m_initialized) PyBuffer_Release(&m_buf); } }; // }}} inline py::tuple get_cl_header_version() { return py::make_tuple( PYOPENCL_CL_VERSION >> (3*4), (PYOPENCL_CL_VERSION >> (1*4)) & 0xff ); } // {{{ platform class platform : noncopyable { private: cl_platform_id m_platform; public: platform(cl_platform_id pid) : m_platform(pid) { } platform(cl_platform_id pid, bool /*retain (ignored)*/) : m_platform(pid) { } cl_platform_id data() const { return m_platform; } PYOPENCL_EQUALITY_TESTS(platform); py::object get_info(cl_platform_info param_name) const { switch (param_name) { case CL_PLATFORM_PROFILE: case CL_PLATFORM_VERSION: case CL_PLATFORM_NAME: case CL_PLATFORM_VENDOR: #if !(defined(CL_PLATFORM_NVIDIA) && CL_PLATFORM_NVIDIA == 0x3001) case CL_PLATFORM_EXTENSIONS: #endif PYOPENCL_GET_STR_INFO(Platform, m_platform, param_name); #if PYOPENCL_CL_VERSION >= 0x2010 case CL_PLATFORM_HOST_TIMER_RESOLUTION: PYOPENCL_GET_TYPED_INFO(Platform, m_platform, param_name, cl_ulong); #endif #if PYOPENCL_CL_VERSION >= 0x3000 case CL_PLATFORM_NUMERIC_VERSION: PYOPENCL_GET_TYPED_INFO(Platform, m_platform, param_name, cl_version); case CL_PLATFORM_EXTENSIONS_WITH_VERSION: { std::vector result; PYOPENCL_GET_VEC_INFO(Platform, m_platform, param_name, result); PYOPENCL_RETURN_VECTOR(cl_name_version, result); } #endif default: throw error("Platform.get_info", CL_INVALID_VALUE); } } py::list get_devices(cl_device_type devtype); }; inline py::list get_platforms() { cl_uint num_platforms = 0; PYOPENCL_CALL_GUARDED(clGetPlatformIDs, (0, 0, &num_platforms)); std::vector platforms(num_platforms); PYOPENCL_CALL_GUARDED(clGetPlatformIDs, (num_platforms, platforms.empty( ) ? nullptr : &platforms.front(), &num_platforms)); py::list result; for (cl_platform_id pid: platforms) result.append(handle_from_new_ptr( new platform(pid))); return result; } // }}} // {{{ device class device : noncopyable { public: enum reference_type_t { REF_NOT_OWNABLE, #if PYOPENCL_CL_VERSION >= 0x1020 REF_CL_1_2, #endif }; private: cl_device_id m_device; reference_type_t m_ref_type; public: device(cl_device_id did) : m_device(did), m_ref_type(REF_NOT_OWNABLE) { } device(cl_device_id did, bool retain, reference_type_t ref_type=REF_NOT_OWNABLE) : m_device(did), m_ref_type(ref_type) { if (retain && ref_type != REF_NOT_OWNABLE) { if (false) { } #if PYOPENCL_CL_VERSION >= 0x1020 else if (ref_type == REF_CL_1_2) { PYOPENCL_CALL_GUARDED(clRetainDevice, (did)); } #endif else throw error("Device", CL_INVALID_VALUE, "cannot own references to devices when device fission or CL 1.2 is not available"); } } ~device() { #if PYOPENCL_CL_VERSION >= 0x1020 if (m_ref_type == REF_CL_1_2) PYOPENCL_CALL_GUARDED_CLEANUP(clReleaseDevice, (m_device)); #endif } cl_device_id data() const { return m_device; } PYOPENCL_EQUALITY_TESTS(device); py::object get_info(cl_device_info param_name) const { #define DEV_GET_INT_INF(TYPE) \ PYOPENCL_GET_TYPED_INFO(Device, m_device, param_name, TYPE); switch (param_name) { case CL_DEVICE_TYPE: DEV_GET_INT_INF(cl_device_type); case CL_DEVICE_VENDOR_ID: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MAX_COMPUTE_UNITS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MAX_WORK_GROUP_SIZE: DEV_GET_INT_INF(size_t); case CL_DEVICE_MAX_WORK_ITEM_SIZES: { std::vector result; PYOPENCL_GET_VEC_INFO(Device, m_device, param_name, result); PYOPENCL_RETURN_VECTOR(size_t, result); } case CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MAX_CLOCK_FREQUENCY: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_ADDRESS_BITS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MAX_READ_IMAGE_ARGS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MAX_WRITE_IMAGE_ARGS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MAX_MEM_ALLOC_SIZE: DEV_GET_INT_INF(cl_ulong); case CL_DEVICE_IMAGE2D_MAX_WIDTH: DEV_GET_INT_INF(size_t); case CL_DEVICE_IMAGE2D_MAX_HEIGHT: DEV_GET_INT_INF(size_t); case CL_DEVICE_IMAGE3D_MAX_WIDTH: DEV_GET_INT_INF(size_t); case CL_DEVICE_IMAGE3D_MAX_HEIGHT: DEV_GET_INT_INF(size_t); case CL_DEVICE_IMAGE3D_MAX_DEPTH: DEV_GET_INT_INF(size_t); case CL_DEVICE_IMAGE_SUPPORT: DEV_GET_INT_INF(cl_bool); case CL_DEVICE_MAX_PARAMETER_SIZE: DEV_GET_INT_INF(size_t); case CL_DEVICE_MAX_SAMPLERS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MEM_BASE_ADDR_ALIGN: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_SINGLE_FP_CONFIG: DEV_GET_INT_INF(cl_device_fp_config); #ifdef CL_DEVICE_DOUBLE_FP_CONFIG case CL_DEVICE_DOUBLE_FP_CONFIG: DEV_GET_INT_INF(cl_device_fp_config); #endif #ifdef CL_DEVICE_HALF_FP_CONFIG case CL_DEVICE_HALF_FP_CONFIG: DEV_GET_INT_INF(cl_device_fp_config); #endif case CL_DEVICE_GLOBAL_MEM_CACHE_TYPE: DEV_GET_INT_INF(cl_device_mem_cache_type); case CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: DEV_GET_INT_INF(cl_ulong); case CL_DEVICE_GLOBAL_MEM_SIZE: DEV_GET_INT_INF(cl_ulong); case CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: DEV_GET_INT_INF(cl_ulong); case CL_DEVICE_MAX_CONSTANT_ARGS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_LOCAL_MEM_TYPE: DEV_GET_INT_INF(cl_device_local_mem_type); case CL_DEVICE_LOCAL_MEM_SIZE: DEV_GET_INT_INF(cl_ulong); case CL_DEVICE_ERROR_CORRECTION_SUPPORT: DEV_GET_INT_INF(cl_bool); case CL_DEVICE_PROFILING_TIMER_RESOLUTION: DEV_GET_INT_INF(size_t); case CL_DEVICE_ENDIAN_LITTLE: DEV_GET_INT_INF(cl_bool); case CL_DEVICE_AVAILABLE: DEV_GET_INT_INF(cl_bool); case CL_DEVICE_COMPILER_AVAILABLE: DEV_GET_INT_INF(cl_bool); case CL_DEVICE_EXECUTION_CAPABILITIES: DEV_GET_INT_INF(cl_device_exec_capabilities); #if PYOPENCL_CL_VERSION >= 0x2000 case CL_DEVICE_QUEUE_ON_HOST_PROPERTIES: DEV_GET_INT_INF(cl_command_queue_properties); #else case CL_DEVICE_QUEUE_PROPERTIES: DEV_GET_INT_INF(cl_command_queue_properties); #endif case CL_DEVICE_NAME: case CL_DEVICE_VENDOR: case CL_DRIVER_VERSION: case CL_DEVICE_PROFILE: case CL_DEVICE_VERSION: case CL_DEVICE_EXTENSIONS: PYOPENCL_GET_STR_INFO(Device, m_device, param_name); case CL_DEVICE_PLATFORM: PYOPENCL_GET_OPAQUE_INFO(Device, m_device, param_name, cl_platform_id, platform); #if PYOPENCL_CL_VERSION >= 0x1010 case CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_NATIVE_VECTOR_WIDTH_INT: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_HOST_UNIFIED_MEMORY: DEV_GET_INT_INF(cl_bool); case CL_DEVICE_OPENCL_C_VERSION: PYOPENCL_GET_STR_INFO(Device, m_device, param_name); #endif #ifdef CL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV case CL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV: case CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV: case CL_DEVICE_REGISTERS_PER_BLOCK_NV: case CL_DEVICE_WARP_SIZE_NV: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_GPU_OVERLAP_NV: case CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: case CL_DEVICE_INTEGRATED_MEMORY_NV: DEV_GET_INT_INF(cl_bool); #endif #ifdef CL_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT_NV case CL_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT_NV: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_PCI_BUS_ID_NV case CL_DEVICE_PCI_BUS_ID_NV: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_PCI_SLOT_ID_NV case CL_DEVICE_PCI_SLOT_ID_NV: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_PCI_DOMAIN_ID_NV case CL_DEVICE_PCI_DOMAIN_ID_NV: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_THREAD_TRACE_SUPPORTED_AMD case CL_DEVICE_THREAD_TRACE_SUPPORTED_AMD: DEV_GET_INT_INF(cl_bool); #endif #ifdef CL_DEVICE_GFXIP_MAJOR_AMD case CL_DEVICE_GFXIP_MAJOR_AMD: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_GFXIP_MINOR_AMD case CL_DEVICE_GFXIP_MINOR_AMD: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_AVAILABLE_ASYNC_QUEUES_AMD case CL_DEVICE_AVAILABLE_ASYNC_QUEUES_AMD: DEV_GET_INT_INF(cl_uint); #endif #if PYOPENCL_CL_VERSION >= 0x1020 case CL_DEVICE_LINKER_AVAILABLE: DEV_GET_INT_INF(cl_bool); case CL_DEVICE_BUILT_IN_KERNELS: PYOPENCL_GET_STR_INFO(Device, m_device, param_name); case CL_DEVICE_IMAGE_MAX_BUFFER_SIZE: DEV_GET_INT_INF(size_t); case CL_DEVICE_IMAGE_MAX_ARRAY_SIZE: DEV_GET_INT_INF(size_t); case CL_DEVICE_PARENT_DEVICE: PYOPENCL_GET_OPAQUE_INFO(Device, m_device, param_name, cl_device_id, device); case CL_DEVICE_PARTITION_MAX_SUB_DEVICES: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PARTITION_TYPE: case CL_DEVICE_PARTITION_PROPERTIES: { std::vector result; PYOPENCL_GET_VEC_INFO(Device, m_device, param_name, result); PYOPENCL_RETURN_VECTOR(cl_device_partition_property, result); } case CL_DEVICE_PARTITION_AFFINITY_DOMAIN: { #if defined(__GNUG__) && !defined(__clang__) #pragma GCC diagnostic push // what's being ignored here is an alignment attribute to native size, which // shouldn't matter on the relevant ABIs that I'm aware of. #pragma GCC diagnostic ignored "-Wignored-attributes" #endif std::vector result; #if defined(__GNUG__) && !defined(__clang__) #pragma GCC diagnostic pop #endif PYOPENCL_GET_VEC_INFO(Device, m_device, param_name, result); PYOPENCL_RETURN_VECTOR(cl_device_affinity_domain, result); } case CL_DEVICE_REFERENCE_COUNT: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PREFERRED_INTEROP_USER_SYNC: DEV_GET_INT_INF(cl_bool); case CL_DEVICE_PRINTF_BUFFER_SIZE: DEV_GET_INT_INF(cl_bool); #endif // {{{ AMD dev attrs cl_amd_device_attribute_query // // types of AMD dev attrs divined from // https://github.com/KhronosGroup/OpenCL-CLHPP/blob/3b03738fef487378b188d21cc5f2bae276aa8721/include/CL/opencl.hpp#L1471-L1500 #ifdef CL_DEVICE_PROFILING_TIMER_OFFSET_AMD case CL_DEVICE_PROFILING_TIMER_OFFSET_AMD: DEV_GET_INT_INF(cl_ulong); #endif #ifdef CL_DEVICE_TOPOLOGY_AMD case CL_DEVICE_TOPOLOGY_AMD: PYOPENCL_GET_TYPED_INFO( Device, m_device, param_name, cl_device_topology_amd); #endif #ifdef CL_DEVICE_BOARD_NAME_AMD case CL_DEVICE_BOARD_NAME_AMD: ; PYOPENCL_GET_STR_INFO(Device, m_device, param_name); #endif #ifdef CL_DEVICE_GLOBAL_FREE_MEMORY_AMD case CL_DEVICE_GLOBAL_FREE_MEMORY_AMD: { std::vector result; PYOPENCL_GET_VEC_INFO(Device, m_device, param_name, result); PYOPENCL_RETURN_VECTOR(size_t, result); } #endif #ifdef CL_DEVICE_SIMD_PER_COMPUTE_UNIT_AMD case CL_DEVICE_SIMD_PER_COMPUTE_UNIT_AMD: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_GLOBAL_MEM_CHANNELS_AMD case CL_DEVICE_GLOBAL_MEM_CHANNELS_AMD: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_GLOBAL_MEM_CHANNEL_BANKS_AMD case CL_DEVICE_GLOBAL_MEM_CHANNEL_BANKS_AMD: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_GLOBAL_MEM_CHANNEL_BANK_WIDTH_AMD case CL_DEVICE_GLOBAL_MEM_CHANNEL_BANK_WIDTH_AMD: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_LOCAL_MEM_SIZE_PER_COMPUTE_UNIT_AMD case CL_DEVICE_LOCAL_MEM_SIZE_PER_COMPUTE_UNIT_AMD: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_LOCAL_MEM_BANKS_AMD case CL_DEVICE_LOCAL_MEM_BANKS_AMD: DEV_GET_INT_INF(cl_uint); #endif // FIXME: MISSING: // // CL_DEVICE_THREAD_TRACE_SUPPORTED_AMD // CL_DEVICE_GFXIP_MAJOR_AMD // CL_DEVICE_GFXIP_MINOR_AMD // CL_DEVICE_AVAILABLE_ASYNC_QUEUES_AMD // CL_DEVICE_PREFERRED_WORK_GROUP_SIZE_AMD // CL_DEVICE_MAX_WORK_GROUP_SIZE_AMD // CL_DEVICE_PREFERRED_CONSTANT_BUFFER_SIZE_AMD // CL_DEVICE_PCIE_ID_AMD // }}} #ifdef CL_DEVICE_MAX_ATOMIC_COUNTERS_EXT case CL_DEVICE_MAX_ATOMIC_COUNTERS_EXT: DEV_GET_INT_INF(cl_uint); #endif #if PYOPENCL_CL_VERSION >= 0x2000 case CL_DEVICE_MAX_READ_WRITE_IMAGE_ARGS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MAX_GLOBAL_VARIABLE_SIZE: DEV_GET_INT_INF(size_t); case CL_DEVICE_QUEUE_ON_DEVICE_PROPERTIES: DEV_GET_INT_INF(cl_command_queue_properties); case CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_QUEUE_ON_DEVICE_MAX_SIZE: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MAX_ON_DEVICE_QUEUES: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_MAX_ON_DEVICE_EVENTS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_SVM_CAPABILITIES: DEV_GET_INT_INF(cl_device_svm_capabilities); case CL_DEVICE_GLOBAL_VARIABLE_PREFERRED_TOTAL_SIZE: DEV_GET_INT_INF(size_t); case CL_DEVICE_MAX_PIPE_ARGS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PIPE_MAX_ACTIVE_RESERVATIONS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PIPE_MAX_PACKET_SIZE: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PREFERRED_PLATFORM_ATOMIC_ALIGNMENT: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PREFERRED_GLOBAL_ATOMIC_ALIGNMENT: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_PREFERRED_LOCAL_ATOMIC_ALIGNMENT: DEV_GET_INT_INF(cl_uint); #endif #if PYOPENCL_CL_VERSION >= 0x2010 case CL_DEVICE_IL_VERSION: PYOPENCL_GET_STR_INFO(Device, m_device, param_name); case CL_DEVICE_MAX_NUM_SUB_GROUPS: DEV_GET_INT_INF(cl_uint); case CL_DEVICE_SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS: DEV_GET_INT_INF(cl_bool); #endif #if PYOPENCL_CL_VERSION >= 0x3000 case CL_DEVICE_NUMERIC_VERSION: DEV_GET_INT_INF(cl_version); case CL_DEVICE_EXTENSIONS_WITH_VERSION: case CL_DEVICE_ILS_WITH_VERSION: case CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION: case CL_DEVICE_OPENCL_C_ALL_VERSIONS: case CL_DEVICE_OPENCL_C_FEATURES: { std::vector result; PYOPENCL_GET_VEC_INFO(Device, m_device, param_name, result); PYOPENCL_RETURN_VECTOR(cl_name_version, result); } case CL_DEVICE_ATOMIC_MEMORY_CAPABILITIES: DEV_GET_INT_INF(cl_device_atomic_capabilities); case CL_DEVICE_ATOMIC_FENCE_CAPABILITIES: DEV_GET_INT_INF(cl_device_atomic_capabilities); case CL_DEVICE_NON_UNIFORM_WORK_GROUP_SUPPORT: DEV_GET_INT_INF(cl_bool); case CL_DEVICE_PREFERRED_WORK_GROUP_SIZE_MULTIPLE: DEV_GET_INT_INF(size_t); case CL_DEVICE_WORK_GROUP_COLLECTIVE_FUNCTIONS_SUPPORT: DEV_GET_INT_INF(cl_bool); case CL_DEVICE_GENERIC_ADDRESS_SPACE_SUPPORT: DEV_GET_INT_INF(cl_bool); #ifdef CL_DEVICE_DEVICE_ENQUEUE_SUPPORT case CL_DEVICE_DEVICE_ENQUEUE_SUPPORT: DEV_GET_INT_INF(cl_bool); #endif #ifdef CL_DEVICE_DEVICE_ENQUEUE_CAPABILITIES case CL_DEVICE_DEVICE_ENQUEUE_CAPABILITIES: DEV_GET_INT_INF(cl_device_device_enqueue_capabilities); #endif case CL_DEVICE_PIPE_SUPPORT: DEV_GET_INT_INF(cl_bool); #endif #ifdef CL_DEVICE_ME_VERSION_INTEL case CL_DEVICE_ME_VERSION_INTEL: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_EXT_MEM_PADDING_IN_BYTES_QCOM case CL_DEVICE_EXT_MEM_PADDING_IN_BYTES_QCOM: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_PAGE_SIZE_QCOM case CL_DEVICE_PAGE_SIZE_QCOM: DEV_GET_INT_INF(cl_uint); #endif #ifdef CL_DEVICE_SPIR_VERSIONS case CL_DEVICE_SPIR_VERSIONS: PYOPENCL_GET_STR_INFO(Device, m_device, param_name); #endif #ifdef CL_DEVICE_CORE_TEMPERATURE_ALTERA case CL_DEVICE_CORE_TEMPERATURE_ALTERA: DEV_GET_INT_INF(cl_int); #endif #ifdef CL_DEVICE_SIMULTANEOUS_INTEROPS_INTEL case CL_DEVICE_SIMULTANEOUS_INTEROPS_INTEL: { std::vector result; PYOPENCL_GET_VEC_INFO(Device, m_device, param_name, result); PYOPENCL_RETURN_VECTOR(cl_uint, result); } #endif #ifdef CL_DEVICE_NUM_SIMULTANEOUS_INTEROPS_INTEL case CL_DEVICE_NUM_SIMULTANEOUS_INTEROPS_INTEL: DEV_GET_INT_INF(cl_uint); #endif default: throw error("Device.get_info", CL_INVALID_VALUE); } } #if PYOPENCL_CL_VERSION >= 0x1020 py::list create_sub_devices(py::object py_properties) { std::vector properties; COPY_PY_LIST(cl_device_partition_property, properties); properties.push_back(0); cl_device_partition_property *props_ptr = properties.empty( ) ? nullptr : &properties.front(); cl_uint num_entries; PYOPENCL_CALL_GUARDED(clCreateSubDevices, (m_device, props_ptr, 0, nullptr, &num_entries)); std::vector result; result.resize(num_entries); PYOPENCL_CALL_GUARDED(clCreateSubDevices, (m_device, props_ptr, num_entries, &result.front(), nullptr)); py::list py_result; for (cl_device_id did: result) py_result.append(handle_from_new_ptr( new pyopencl::device(did, /*retain*/true, device::REF_CL_1_2))); return py_result; } #endif #if PYOPENCL_CL_VERSION >= 0x2010 py::tuple device_and_host_timer() const { cl_ulong device_timestamp, host_timestamp; PYOPENCL_CALL_GUARDED(clGetDeviceAndHostTimer, (m_device, &device_timestamp, &host_timestamp)); return py::make_tuple(device_timestamp, host_timestamp); } cl_ulong host_timer() const { cl_ulong host_timestamp; PYOPENCL_CALL_GUARDED(clGetHostTimer, (m_device, &host_timestamp)); return host_timestamp; } #endif }; inline py::list platform::get_devices(cl_device_type devtype) { cl_uint num_devices = 0; PYOPENCL_PRINT_CALL_TRACE("clGetDeviceIDs"); { cl_int status_code; status_code = clGetDeviceIDs(m_platform, devtype, 0, 0, &num_devices); if (status_code == CL_DEVICE_NOT_FOUND) num_devices = 0; else if (status_code != CL_SUCCESS) \ throw pyopencl::error("clGetDeviceIDs", status_code); } if (num_devices == 0) return py::list(); std::vector devices(num_devices); PYOPENCL_CALL_GUARDED(clGetDeviceIDs, (m_platform, devtype, num_devices, devices.empty( ) ? nullptr : &devices.front(), &num_devices)); py::list result; for (cl_device_id did: devices) result.append(handle_from_new_ptr( new device(did))); return result; } // }}} // {{{ context class context : public noncopyable, public py::intrusive_base { private: cl_context m_context; public: context(cl_context ctx, bool retain) : m_context(ctx) { if (retain) PYOPENCL_CALL_GUARDED(clRetainContext, (ctx)); } ~context() { PYOPENCL_CALL_GUARDED_CLEANUP(clReleaseContext, (m_context)); } cl_context data() const { return m_context; } PYOPENCL_EQUALITY_TESTS(context); py::object get_info(cl_context_info param_name) const { switch (param_name) { case CL_CONTEXT_REFERENCE_COUNT: PYOPENCL_GET_TYPED_INFO( Context, m_context, param_name, cl_uint); case CL_CONTEXT_DEVICES: { std::vector result; PYOPENCL_GET_VEC_INFO(Context, m_context, param_name, result); py::list py_result; for (cl_device_id did: result) py_result.append(handle_from_new_ptr( new pyopencl::device(did))); return py_result; } case CL_CONTEXT_PROPERTIES: { std::vector result; PYOPENCL_GET_VEC_INFO(Context, m_context, param_name, result); py::list py_result; for (size_t i = 0; i < result.size(); i+=2) { cl_context_properties key = result[i]; py::object value; switch (key) { case CL_CONTEXT_PLATFORM: { value = py::object( handle_from_new_ptr(new platform( reinterpret_cast(result[i+1])))); break; } #if defined(PYOPENCL_GL_SHARING_VERSION) && (PYOPENCL_GL_SHARING_VERSION >= 1) #if defined(__APPLE__) && defined(HAVE_GL) case CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE: #else case CL_GL_CONTEXT_KHR: case CL_EGL_DISPLAY_KHR: case CL_GLX_DISPLAY_KHR: case CL_WGL_HDC_KHR: case CL_CGL_SHAREGROUP_KHR: #endif value = py::cast(result[i+1]); break; #endif case 0: break; default: throw error("Context.get_info", CL_INVALID_VALUE, "unknown context_property key encountered"); } py_result.append(py::make_tuple(result[i], value)); } return py_result; } #if PYOPENCL_CL_VERSION >= 0x1010 case CL_CONTEXT_NUM_DEVICES: PYOPENCL_GET_TYPED_INFO( Context, m_context, param_name, cl_uint); #endif default: throw error("Context.get_info", CL_INVALID_VALUE); } } // not exposed to python int get_hex_platform_version() const { std::vector devices; PYOPENCL_GET_VEC_INFO(Context, m_context, CL_CONTEXT_DEVICES, devices); if (devices.size() == 0) throw error("Context._get_hex_version", CL_INVALID_VALUE, "platform has no devices"); cl_platform_id plat; PYOPENCL_CALL_GUARDED(clGetDeviceInfo, (devices[0], CL_DEVICE_PLATFORM, sizeof(plat), &plat, nullptr)); std::string plat_version; { size_t param_value_size; PYOPENCL_CALL_GUARDED(clGetPlatformInfo, (plat, CL_PLATFORM_VERSION, 0, 0, ¶m_value_size)); std::vector param_value(param_value_size); PYOPENCL_CALL_GUARDED(clGetPlatformInfo, (plat, CL_PLATFORM_VERSION, param_value_size, param_value.empty( ) ? nullptr : ¶m_value.front(), ¶m_value_size)); plat_version = param_value.empty( ) ? "" : std::string(¶m_value.front(), param_value_size-1); } int major_ver, minor_ver; errno = 0; int match_count = sscanf(plat_version.c_str(), "OpenCL %d.%d ", &major_ver, &minor_ver); if (errno || match_count != 2) throw error("Context._get_hex_platform_version", CL_INVALID_VALUE, "Platform version string did not have expected format"); return major_ver << 12 | minor_ver << 4; } #if PYOPENCL_CL_VERSION >= 0x2010 void set_default_device_command_queue(device const &dev, command_queue const &queue); #endif }; inline std::vector parse_context_properties( py::object py_properties) { std::vector props; if (py_properties.ptr() != Py_None) { for (py::handle prop_tuple_py: py_properties) { py::tuple prop_tuple(py::cast(prop_tuple_py)); if (len(prop_tuple) != 2) throw error("Context", CL_INVALID_VALUE, "property tuple must have length 2"); cl_context_properties prop = py::cast(prop_tuple[0]); props.push_back(prop); if (prop == CL_CONTEXT_PLATFORM) { props.push_back( reinterpret_cast( py::cast(prop_tuple[1]).data())); } #if defined(PYOPENCL_GL_SHARING_VERSION) && (PYOPENCL_GL_SHARING_VERSION >= 1) #if defined(_WIN32) else if (prop == CL_WGL_HDC_KHR) { // size_t is a stand-in for HANDLE, hopefully has the same size. size_t hnd = py::cast(prop_tuple[1]); props.push_back(hnd); } #endif else if ( #if defined(__APPLE__) && defined(HAVE_GL) prop == CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE #else prop == CL_GL_CONTEXT_KHR || prop == CL_EGL_DISPLAY_KHR || prop == CL_GLX_DISPLAY_KHR || prop == CL_CGL_SHAREGROUP_KHR #endif ) { py::object ctypes = py::module_::import_("ctypes"); py::object prop = prop_tuple[1], c_void_p = ctypes.attr("c_void_p"); py::object ptr = ctypes.attr("cast")(prop, c_void_p); props.push_back(py::cast(ptr.attr("value"))); } #endif else throw error("Context", CL_INVALID_VALUE, "invalid context property"); } props.push_back(0); } return props; } inline void create_context_inner(context *self, py::object py_devices, py::object py_properties, py::object py_dev_type) { std::vector props = parse_context_properties(py_properties); cl_context_properties *props_ptr = props.empty( ) ? nullptr : &props.front(); cl_int status_code; cl_context ctx; // from device list if (py_devices.ptr() != Py_None) { if (py_dev_type.ptr() != Py_None) throw error("Context", CL_INVALID_VALUE, "one of 'devices' or 'dev_type' must be None"); std::vector devices; for (py::handle py_dev: py_devices) devices.push_back(py::cast(py_dev).data()); PYOPENCL_PRINT_CALL_TRACE("clCreateContext"); ctx = clCreateContext( props_ptr, devices.size(), devices.empty( ) ? nullptr : &devices.front(), 0, 0, &status_code); } // from dev_type else { cl_device_type dev_type = CL_DEVICE_TYPE_DEFAULT; if (py_dev_type.ptr() != Py_None) dev_type = py::cast(py_dev_type); PYOPENCL_PRINT_CALL_TRACE("clCreateContextFromType"); ctx = clCreateContextFromType(props_ptr, dev_type, 0, 0, &status_code); } if (status_code != CL_SUCCESS) throw pyopencl::error("Context", status_code); try { new (self) context(ctx, false); } catch (...) { PYOPENCL_CALL_GUARDED(clReleaseContext, (ctx)); throw; } } // }}} // {{{ command_queue class command_queue: public py::intrusive_base { private: cl_command_queue m_queue; // m_finalized==True indicates that this command queue should no longer // be used. An example of this is if a command queue is used as a context // manager, after the 'with' block exits. // // This mechanism is not foolproof, as it is perfectly possible to create // other Python proxy objects referring to the same underlying // cl_command_queue. Even so, this ought to flag a class of potentially // very damaging synchronization bugs. bool m_finalized; public: command_queue(cl_command_queue q, bool retain) : m_queue(q), m_finalized(false) { if (retain) PYOPENCL_CALL_GUARDED(clRetainCommandQueue, (q)); } command_queue(command_queue const &src) : m_queue(src.m_queue), m_finalized(false) { PYOPENCL_CALL_GUARDED(clRetainCommandQueue, (m_queue)); } command_queue( const context &ctx, const device *py_dev=nullptr, py::object py_props=py::none()) : m_finalized(false) { cl_device_id dev; if (py_dev) dev = py_dev->data(); else { std::vector devs; PYOPENCL_GET_VEC_INFO(Context, ctx.data(), CL_CONTEXT_DEVICES, devs); if (devs.size() == 0) throw pyopencl::error("CommandQueue", CL_INVALID_VALUE, "context doesn't have any devices? -- don't know which one to default to"); dev = devs[0]; } int hex_plat_version = ctx.get_hex_platform_version(); bool props_given_as_numeric; cl_command_queue_properties num_props; if (py_props.is_none()) { num_props = 0; props_given_as_numeric = true; } else { try { num_props = py::cast(py_props); props_given_as_numeric = true; } catch (py::cast_error &) { props_given_as_numeric = false; } } if (props_given_as_numeric) { #if PYOPENCL_CL_VERSION >= 0x2000 if (hex_plat_version >= 0x2000) { cl_queue_properties props_list[] = { CL_QUEUE_PROPERTIES, num_props, 0 }; cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateCommandQueueWithProperties"); m_queue = clCreateCommandQueueWithProperties( ctx.data(), dev, props_list, &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("CommandQueue", status_code); } else #endif { cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateCommandQueue"); #if defined(__GNUG__) && !defined(__clang__) #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wdeprecated-declarations" #endif m_queue = clCreateCommandQueue( ctx.data(), dev, num_props, &status_code); #if defined(__GNUG__) && !defined(__clang__) #pragma GCC diagnostic pop #endif if (status_code != CL_SUCCESS) throw pyopencl::error("CommandQueue", status_code); } } else { #if PYOPENCL_CL_VERSION < 0x2000 throw error("CommandQueue", CL_INVALID_VALUE, "queue properties given as an iterable, " "which is only allowed when PyOpenCL was built " "against an OpenCL 2+ header"); #else if (hex_plat_version < 0x2000) { std::cerr << "queue properties given as an iterable, " "which uses an OpenCL 2+-only interface, " "but the context's platform does not " "declare OpenCL 2 support. Proceeding " "as requested, but the next thing you see " "may be a crash." << std:: endl; } PYOPENCL_STACK_CONTAINER(cl_queue_properties, props, py::len(py_props) + 1); { size_t i = 0; for (auto prop: py_props) props[i++] = py::cast(prop); props[i++] = 0; } cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateCommandQueueWithProperties"); m_queue = clCreateCommandQueueWithProperties( ctx.data(), dev, PYOPENCL_STACK_CONTAINER_GET_PTR(props), &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("CommandQueue", status_code); #endif } } ~command_queue() { PYOPENCL_CALL_GUARDED_CLEANUP(clReleaseCommandQueue, (m_queue)); } const cl_command_queue data() const { if (m_finalized) { auto mod_warnings(py::module_::import_("warnings")); auto mod_cl(py::module_::import_("pyopencl")); mod_warnings.attr("warn")( "Command queue used after exit of context manager. " "This is deprecated and will stop working in 2023.", mod_cl.attr("CommandQueueUsedAfterExit") ); } return m_queue; } void finalize() { m_finalized = true; } PYOPENCL_EQUALITY_TESTS(command_queue); py::object get_info(cl_command_queue_info param_name) const { switch (param_name) { case CL_QUEUE_CONTEXT: PYOPENCL_GET_OPAQUE_INFO(CommandQueue, m_queue, param_name, cl_context, context); case CL_QUEUE_DEVICE: PYOPENCL_GET_OPAQUE_INFO(CommandQueue, m_queue, param_name, cl_device_id, device); case CL_QUEUE_REFERENCE_COUNT: PYOPENCL_GET_TYPED_INFO(CommandQueue, m_queue, param_name, cl_uint); case CL_QUEUE_PROPERTIES: PYOPENCL_GET_TYPED_INFO(CommandQueue, m_queue, param_name, cl_command_queue_properties); #if PYOPENCL_CL_VERSION >= 0x2000 case CL_QUEUE_SIZE: PYOPENCL_GET_TYPED_INFO(CommandQueue, m_queue, param_name, cl_uint); #endif #if PYOPENCL_CL_VERSION >= 0x2010 case CL_QUEUE_DEVICE_DEFAULT: PYOPENCL_GET_OPAQUE_INFO( CommandQueue, m_queue, param_name, cl_command_queue, command_queue); #endif #if PYOPENCL_CL_VERSION >= 0x3000 case CL_QUEUE_PROPERTIES_ARRAY: { std::vector result; PYOPENCL_GET_VEC_INFO(CommandQueue, data(), param_name, result); PYOPENCL_RETURN_VECTOR(cl_queue_properties, result); } #endif default: throw error("CommandQueue.get_info", CL_INVALID_VALUE); } } py::ref get_context() const { cl_context param_value; PYOPENCL_CALL_GUARDED(clGetCommandQueueInfo, (data(), CL_QUEUE_CONTEXT, sizeof(param_value), ¶m_value, 0)); return py::ref(new context(param_value, /*retain*/ true)); } #if PYOPENCL_CL_VERSION < 0x1010 cl_command_queue_properties set_property( cl_command_queue_properties prop, bool enable) { cl_command_queue_properties old_prop; PYOPENCL_CALL_GUARDED(clSetCommandQueueProperty, (data(), prop, PYOPENCL_CAST_BOOL(enable), &old_prop)); return old_prop; } #endif void flush() { PYOPENCL_CALL_GUARDED(clFlush, (data())); } void finish() { if (m_finalized) { return; } else { cl_command_queue queue = data(); PYOPENCL_CALL_GUARDED_THREADED(clFinish, (queue)); } } // not exposed to python int get_hex_device_version() const { cl_device_id dev; PYOPENCL_CALL_GUARDED(clGetCommandQueueInfo, (data(), CL_QUEUE_DEVICE, sizeof(dev), &dev, nullptr)); std::string dev_version; { size_t param_value_size; PYOPENCL_CALL_GUARDED(clGetDeviceInfo, (dev, CL_DEVICE_VERSION, 0, 0, ¶m_value_size)); std::vector param_value(param_value_size); PYOPENCL_CALL_GUARDED(clGetDeviceInfo, (dev, CL_DEVICE_VERSION, param_value_size, param_value.empty( ) ? nullptr : ¶m_value.front(), ¶m_value_size)); dev_version = param_value.empty( ) ? "" : std::string(¶m_value.front(), param_value_size-1); } int major_ver, minor_ver; errno = 0; int match_count = sscanf(dev_version.c_str(), "OpenCL %d.%d ", &major_ver, &minor_ver); if (errno || match_count != 2) throw error("CommandQueue._get_hex_device_version", CL_INVALID_VALUE, "Platform version string did not have expected format"); return major_ver << 12 | minor_ver << 4; } }; // }}} // {{{ command_queue_ref // In contrast to command_queue, command_queue_ref is "nullable", i.e. // it is a RAII *optional* reference to a command queue. class command_queue_ref { private: bool m_valid; cl_command_queue m_queue; public: command_queue_ref() : m_valid(false) {} command_queue_ref(cl_command_queue queue) : m_valid(queue != nullptr), m_queue(queue) { // E.g. SVM allocations of size zero use a NULL queue. Tolerate that. if (m_valid) PYOPENCL_CALL_GUARDED(clRetainCommandQueue, (m_queue)); } command_queue_ref(command_queue_ref &&src) noexcept : m_valid(src.m_valid), m_queue(src.m_queue) { src.m_valid = false; } command_queue_ref(const command_queue_ref &src) : m_valid(src.m_valid), m_queue(src.m_queue) { // Note that there isn't anything per se wrong with this // copy constructor, the refcounting is just potentially // expensive. // // All code in current use moves these, it does not copy them, // so this should never get called. // // Unfortunately, we can't delete this copy constructor, // because we would like to return these from functions. // This makes at least gcc require copy constructors, even // if those are never called due to NRVO. std::cerr << "COPYING A COMMAND_QUEUE_REF." << std::endl; if (m_valid) PYOPENCL_CALL_GUARDED(clRetainCommandQueue, (m_queue)); } command_queue_ref &operator=(const command_queue_ref &) = delete; ~command_queue_ref() { reset(); } bool is_valid() const { return m_valid; } cl_command_queue data() const { if (m_valid) return m_queue; else throw error("command_queue_ref.data", CL_INVALID_VALUE, "command_queue_ref is not valid"); } void reset() { if (m_valid) PYOPENCL_CALL_GUARDED_CLEANUP(clReleaseCommandQueue, (m_queue)); m_valid = false; } void set(cl_command_queue queue) { if (!queue) throw error("command_queue_ref.set", CL_INVALID_VALUE, "cannot set to NULL command queue"); if (m_valid) PYOPENCL_CALL_GUARDED(clReleaseCommandQueue, (m_queue)); m_queue = queue; PYOPENCL_CALL_GUARDED(clRetainCommandQueue, (m_queue)); m_valid = true; } }; // }}} // {{{ event/synchronization class event : noncopyable { private: cl_event m_event; public: event(cl_event event, bool retain) : m_event(event) { if (retain) PYOPENCL_CALL_GUARDED(clRetainEvent, (event)); } event(event const &src) : m_event(src.m_event) { PYOPENCL_CALL_GUARDED(clRetainEvent, (m_event)); } virtual ~event() { PYOPENCL_CALL_GUARDED_CLEANUP(clReleaseEvent, (m_event)); } const cl_event data() const { return m_event; } PYOPENCL_EQUALITY_TESTS(event); py::object get_info(cl_event_info param_name) const { switch (param_name) { case CL_EVENT_COMMAND_QUEUE: PYOPENCL_GET_OPAQUE_INFO(Event, m_event, param_name, cl_command_queue, command_queue); case CL_EVENT_COMMAND_TYPE: PYOPENCL_GET_TYPED_INFO(Event, m_event, param_name, cl_command_type); case CL_EVENT_COMMAND_EXECUTION_STATUS: PYOPENCL_GET_TYPED_INFO(Event, m_event, param_name, cl_int); case CL_EVENT_REFERENCE_COUNT: PYOPENCL_GET_TYPED_INFO(Event, m_event, param_name, cl_uint); #if PYOPENCL_CL_VERSION >= 0x1010 case CL_EVENT_CONTEXT: PYOPENCL_GET_OPAQUE_INFO(Event, m_event, param_name, cl_context, context); #endif default: throw error("Event.get_info", CL_INVALID_VALUE); } } py::object get_profiling_info(cl_profiling_info param_name) const { switch (param_name) { case CL_PROFILING_COMMAND_QUEUED: case CL_PROFILING_COMMAND_SUBMIT: case CL_PROFILING_COMMAND_START: case CL_PROFILING_COMMAND_END: #if PYOPENCL_CL_VERSION >= 0x2000 case CL_PROFILING_COMMAND_COMPLETE: #endif PYOPENCL_GET_TYPED_INFO(EventProfiling, m_event, param_name, cl_ulong); default: throw error("Event.get_profiling_info", CL_INVALID_VALUE); } } virtual void wait() { PYOPENCL_CALL_GUARDED_THREADED(clWaitForEvents, (1, &m_event)); } // Called from a destructor context below: // - Should not release the GIL // - Should fail gracefully in the face of errors virtual void wait_during_cleanup_without_releasing_the_gil() { PYOPENCL_CALL_GUARDED_CLEANUP(clWaitForEvents, (1, &m_event)); } #if PYOPENCL_CL_VERSION >= 0x1010 // {{{ set_callback, by way of a a thread-based construction private: struct event_callback_info_t { std::mutex m_mutex; std::condition_variable m_condvar; // FIXME: Should implement GC traversal so that these can be collected. py::object m_py_event; py::object m_py_callback; bool m_set_callback_succeeded; bool m_notify_thread_wakeup_is_genuine; cl_event m_event; cl_int m_command_exec_status; event_callback_info_t(py::object py_event, py::object py_callback) : m_py_event(py_event), m_py_callback(py_callback), m_set_callback_succeeded(true), m_notify_thread_wakeup_is_genuine(false) {} }; static void CL_CALLBACK evt_callback(cl_event evt, cl_int command_exec_status, void *user_data) { event_callback_info_t *cb_info = reinterpret_cast(user_data); { std::lock_guard lg(cb_info->m_mutex); cb_info->m_event = evt; cb_info->m_command_exec_status = command_exec_status; cb_info->m_notify_thread_wakeup_is_genuine = true; } cb_info->m_condvar.notify_one(); } public: void set_callback(cl_int command_exec_callback_type, py::object pfn_event_notify) { // The reason for doing this via a thread is that we're able to wait on // acquiring the GIL. (which we can't in the callback) std::unique_ptr cb_info_holder( new event_callback_info_t( handle_from_new_ptr(new event(*this)), pfn_event_notify)); event_callback_info_t *cb_info = cb_info_holder.get(); std::thread notif_thread([cb_info]() { { std::unique_lock ulk(cb_info->m_mutex); cb_info->m_condvar.wait( ulk, [&](){ return cb_info->m_notify_thread_wakeup_is_genuine; }); // ulk no longer held here, cb_info ready for deletion } { py::gil_scoped_acquire acquire; if (cb_info->m_set_callback_succeeded) { try { cb_info->m_py_callback( // cb_info->m_py_event, cb_info->m_command_exec_status); } catch (std::exception &exc) { std::cerr << "[pyopencl] event callback handler threw an exception, ignoring: " << exc.what() << std::endl; } } // Need to hold GIL to delete py::object instances in // event_callback_info_t delete cb_info; } }); // Thread is away--it is now its responsibility to free cb_info. cb_info_holder.release(); // notif_thread should no longer be coupled to the lifetime of the thread. notif_thread.detach(); try { PYOPENCL_CALL_GUARDED(clSetEventCallback, ( data(), command_exec_callback_type, &event::evt_callback, cb_info)); } catch (...) { // Setting the callback did not succeed. The thread would never // be woken up. Wake it up to let it know that it can stop. { std::lock_guard lg(cb_info->m_mutex); cb_info->m_set_callback_succeeded = false; cb_info->m_notify_thread_wakeup_is_genuine = true; } cb_info->m_condvar.notify_one(); throw; } } // }}} #endif }; class nanny_event : public event { // In addition to everything an event does, the nanny event holds a reference // to a Python object and waits for its own completion upon destruction. protected: std::unique_ptr m_ward; public: nanny_event(cl_event evt, bool retain, std::unique_ptr &ward) : event(evt, retain), m_ward(std::move(ward)) { } ~nanny_event() { // It appears that Pybind can get very confused if we release the GIL here: // https://github.com/inducer/pyopencl/issues/296 wait_during_cleanup_without_releasing_the_gil(); } py::object get_ward() const { if (m_ward.get()) { return py::borrow(m_ward->m_buf.obj); } else return py::none(); } virtual void wait() { event::wait(); m_ward.reset(); } virtual void wait_during_cleanup_without_releasing_the_gil() { event::wait_during_cleanup_without_releasing_the_gil(); m_ward.reset(); } }; inline void wait_for_events(py::object events) { cl_uint num_events_in_wait_list = 0; std::vector event_wait_list(len(events)); for (py::handle evt: events) event_wait_list[num_events_in_wait_list++] = py::cast(evt).data(); PYOPENCL_CALL_GUARDED_THREADED(clWaitForEvents, ( PYOPENCL_WAITLIST_ARGS)); } #if PYOPENCL_CL_VERSION >= 0x1020 inline event *enqueue_marker_with_wait_list(command_queue &cq, py::object py_wait_for) { PYOPENCL_PARSE_WAIT_FOR; cl_event evt; PYOPENCL_CALL_GUARDED(clEnqueueMarkerWithWaitList, ( cq.data(), PYOPENCL_WAITLIST_ARGS, &evt)); PYOPENCL_RETURN_NEW_EVENT(evt); } inline event *enqueue_barrier_with_wait_list(command_queue &cq, py::object py_wait_for) { PYOPENCL_PARSE_WAIT_FOR; cl_event evt; PYOPENCL_CALL_GUARDED(clEnqueueBarrierWithWaitList, (cq.data(), PYOPENCL_WAITLIST_ARGS, &evt)); PYOPENCL_RETURN_NEW_EVENT(evt); } #endif // {{{ used internally for pre-OpenCL-1.2 contexts inline event *enqueue_marker(command_queue &cq) { cl_event evt; PYOPENCL_CALL_GUARDED(clEnqueueMarker, ( cq.data(), &evt)); PYOPENCL_RETURN_NEW_EVENT(evt); } inline void enqueue_wait_for_events(command_queue &cq, py::object py_events) { cl_uint num_events = 0; std::vector event_list(len(py_events)); for (py::handle py_evt: py_events) event_list[num_events++] = py::cast(py_evt).data(); PYOPENCL_CALL_GUARDED(clEnqueueWaitForEvents, ( cq.data(), num_events, event_list.empty( ) ? nullptr : &event_list.front())); } inline void enqueue_barrier(command_queue &cq) { PYOPENCL_CALL_GUARDED(clEnqueueBarrier, (cq.data())); } // }}} #if PYOPENCL_CL_VERSION >= 0x1010 class user_event : public event { public: user_event(cl_event evt, bool retain) : event(evt, retain) { } void set_status(cl_int execution_status) { PYOPENCL_CALL_GUARDED(clSetUserEventStatus, (data(), execution_status)); } }; inline void create_user_event(user_event *self, context &ctx) { cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateUserEvent"); cl_event evt = clCreateUserEvent(ctx.data(), &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("UserEvent", status_code); try { new (self) user_event(evt, false); } catch (...) { clReleaseEvent(evt); throw; } } #endif // }}} // {{{ memory_object py::object create_mem_object_wrapper(cl_mem mem, bool retain); class memory_object_holder { public: virtual const cl_mem data() const = 0; PYOPENCL_EQUALITY_TESTS(memory_object_holder); size_t size() const { size_t param_value; PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, (data(), CL_MEM_SIZE, sizeof(param_value), ¶m_value, 0)); return param_value; } py::object get_info(cl_mem_info param_name) const; virtual ~memory_object_holder() { } }; class memory_object : noncopyable, public memory_object_holder { public: typedef std::unique_ptr hostbuf_t; private: bool m_valid; cl_mem m_mem; hostbuf_t m_hostbuf; public: memory_object(cl_mem mem, bool retain, hostbuf_t hostbuf=hostbuf_t()) : m_valid(true), m_mem(mem) { if (retain) PYOPENCL_CALL_GUARDED(clRetainMemObject, (mem)); m_hostbuf = std::move(hostbuf); } memory_object(memory_object &src) : m_valid(true), m_mem(src.m_mem) { PYOPENCL_CALL_GUARDED(clRetainMemObject, (m_mem)); } memory_object(memory_object &&src) : m_valid(true), m_mem(src.m_mem), m_hostbuf(std::move(src.m_hostbuf)) { } memory_object(memory_object_holder const &src) : m_valid(true), m_mem(src.data()) { PYOPENCL_CALL_GUARDED(clRetainMemObject, (m_mem)); } void release() { if (!m_valid) throw error("MemoryObject.free", CL_INVALID_VALUE, "trying to double-unref mem object"); PYOPENCL_CALL_GUARDED_CLEANUP(clReleaseMemObject, (m_mem)); m_valid = false; } ~memory_object() { if (m_valid) release(); } py::object hostbuf() { if (m_hostbuf.get()) return py::borrow(m_hostbuf->m_buf.obj); else return py::none(); } const cl_mem data() const { return m_mem; } }; #if PYOPENCL_CL_VERSION >= 0x1020 inline event *enqueue_migrate_mem_objects( command_queue &cq, py::object py_mem_objects, cl_mem_migration_flags flags, py::object py_wait_for) { PYOPENCL_PARSE_WAIT_FOR; std::vector mem_objects; for (py::handle mo: py_mem_objects) mem_objects.push_back(py::cast(mo).data()); cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED(clEnqueueMigrateMemObjects, ( cq.data(), mem_objects.size(), mem_objects.empty( ) ? nullptr : &mem_objects.front(), flags, PYOPENCL_WAITLIST_ARGS, &evt )); ); PYOPENCL_RETURN_NEW_EVENT(evt); } #endif // }}} // {{{ buffer inline cl_mem create_buffer( cl_context ctx, cl_mem_flags flags, size_t size, void *host_ptr) { cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateBuffer"); cl_mem mem = clCreateBuffer(ctx, flags, size, host_ptr, &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("create_buffer", status_code); return mem; } inline cl_mem create_buffer_gc( cl_context ctx, cl_mem_flags flags, size_t size, void *host_ptr) { PYOPENCL_RETRY_RETURN_IF_MEM_ERROR( return create_buffer(ctx, flags, size, host_ptr); ); } #if PYOPENCL_CL_VERSION >= 0x1010 inline cl_mem create_sub_buffer( cl_mem buffer, cl_mem_flags flags, cl_buffer_create_type bct, const void *buffer_create_info) { cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateSubBuffer"); cl_mem mem = clCreateSubBuffer(buffer, flags, bct, buffer_create_info, &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("clCreateSubBuffer", status_code); return mem; } inline cl_mem create_sub_buffer_gc( cl_mem buffer, cl_mem_flags flags, cl_buffer_create_type bct, const void *buffer_create_info) { PYOPENCL_RETRY_RETURN_IF_MEM_ERROR( return create_sub_buffer(buffer, flags, bct, buffer_create_info); ); } #endif class buffer : public memory_object { public: buffer(cl_mem mem, bool retain, hostbuf_t hostbuf=hostbuf_t()) : memory_object(mem, retain, std::move(hostbuf)) { } #if PYOPENCL_CL_VERSION >= 0x1010 buffer *get_sub_region( size_t origin, size_t size, cl_mem_flags flags) const { cl_buffer_region region = { origin, size}; cl_mem mem = create_sub_buffer_gc( data(), flags, CL_BUFFER_CREATE_TYPE_REGION, ®ion); try { return new buffer(mem, false); } catch (...) { PYOPENCL_CALL_GUARDED(clReleaseMemObject, (mem)); throw; } } buffer *getitem(py::object slc) const { PYOPENCL_BUFFER_SIZE_T start, end, stride, length; if (!PySlice_Check(slc.ptr())) throw pyopencl::error("Buffer.__getitem__", CL_INVALID_VALUE, "Buffer slice must be a slice object"); size_t my_length; PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, (data(), CL_MEM_SIZE, sizeof(my_length), &my_length, 0)); if (PySlice_GetIndicesEx(slc.ptr(), my_length, &start, &end, &stride, &length) != 0) throw py::python_error(); if (stride != 1) throw pyopencl::error("Buffer.__getitem__", CL_INVALID_VALUE, "Buffer slice must have stride 1"); cl_mem_flags my_flags; PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, (data(), CL_MEM_FLAGS, sizeof(my_flags), &my_flags, 0)); my_flags &= ~CL_MEM_COPY_HOST_PTR; if (end <= start) throw pyopencl::error("Buffer.__getitem__", CL_INVALID_VALUE, "Buffer slice have end > start"); return get_sub_region(start, end-start, my_flags); } #endif }; // {{{ buffer creation inline void create_buffer_py( buffer *self, context &ctx, cl_mem_flags flags, size_t size, py::object py_hostbuf ) { if (py_hostbuf.ptr() != Py_None && !(flags & (CL_MEM_USE_HOST_PTR | CL_MEM_COPY_HOST_PTR))) PyErr_Warn(PyExc_UserWarning, "'hostbuf' was passed, " "but no memory flags to make use of it."); void *buf = 0; std::unique_ptr retained_buf_obj; if (py_hostbuf.ptr() != Py_None) { retained_buf_obj = std::unique_ptr(new py_buffer_wrapper); int py_buf_flags = PyBUF_ANY_CONTIGUOUS; if ((flags & CL_MEM_USE_HOST_PTR) && ((flags & CL_MEM_READ_WRITE) || (flags & CL_MEM_WRITE_ONLY))) py_buf_flags |= PyBUF_WRITABLE; retained_buf_obj->get(py_hostbuf.ptr(), py_buf_flags); buf = retained_buf_obj->m_buf.buf; if (size > size_t(retained_buf_obj->m_buf.len)) throw pyopencl::error("Buffer", CL_INVALID_VALUE, "specified size is greater than host buffer size"); if (size == 0) size = retained_buf_obj->m_buf.len; } cl_mem mem = create_buffer_gc(ctx.data(), flags, size, buf); if (!(flags & CL_MEM_USE_HOST_PTR)) retained_buf_obj.reset(); try { new (self) buffer(mem, false, std::move(retained_buf_obj)); } catch (...) { PYOPENCL_CALL_GUARDED(clReleaseMemObject, (mem)); throw; } } // }}} // {{{ buffer transfers // {{{ byte-for-byte transfers inline event *enqueue_read_buffer( command_queue &cq, memory_object_holder &mem, py::object buffer, size_t src_offset, py::object py_wait_for, bool is_blocking) { PYOPENCL_PARSE_WAIT_FOR; void *buf; PYOPENCL_BUFFER_SIZE_T len; std::unique_ptr ward(new py_buffer_wrapper); ward->get(buffer.ptr(), PyBUF_ANY_CONTIGUOUS | PyBUF_WRITABLE); buf = ward->m_buf.buf; len = ward->m_buf.len; cl_command_queue queue = cq.data(); cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED_THREADED(clEnqueueReadBuffer, ( queue, mem.data(), PYOPENCL_CAST_BOOL(is_blocking), src_offset, len, buf, PYOPENCL_WAITLIST_ARGS, &evt )) ); PYOPENCL_RETURN_NEW_NANNY_EVENT(evt, ward); } inline event *enqueue_write_buffer( command_queue &cq, memory_object_holder &mem, py::object buffer, size_t dst_offset, py::object py_wait_for, bool is_blocking) { PYOPENCL_PARSE_WAIT_FOR; const void *buf; PYOPENCL_BUFFER_SIZE_T len; std::unique_ptr ward(new py_buffer_wrapper); ward->get(buffer.ptr(), PyBUF_ANY_CONTIGUOUS); buf = ward->m_buf.buf; len = ward->m_buf.len; cl_command_queue queue = cq.data(); cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED_THREADED(clEnqueueWriteBuffer, ( queue, mem.data(), PYOPENCL_CAST_BOOL(is_blocking), dst_offset, len, buf, PYOPENCL_WAITLIST_ARGS, &evt )) ); PYOPENCL_RETURN_NEW_NANNY_EVENT(evt, ward); } inline event *enqueue_copy_buffer( command_queue &cq, memory_object_holder &src, memory_object_holder &dst, ptrdiff_t byte_count, size_t src_offset, size_t dst_offset, py::object py_wait_for) { PYOPENCL_PARSE_WAIT_FOR; if (byte_count < 0) { size_t byte_count_src = 0; size_t byte_count_dst = 0; PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, (src.data(), CL_MEM_SIZE, sizeof(byte_count), &byte_count_src, 0)); PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, (src.data(), CL_MEM_SIZE, sizeof(byte_count), &byte_count_dst, 0)); byte_count = std::min(byte_count_src, byte_count_dst); } cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED(clEnqueueCopyBuffer, ( cq.data(), src.data(), dst.data(), src_offset, dst_offset, byte_count, PYOPENCL_WAITLIST_ARGS, &evt )) ); PYOPENCL_RETURN_NEW_EVENT(evt); } #ifdef CL_DEVICE_P2P_DEVICES_AMD inline event *enqueue_copy_buffer_p2p_amd( platform &plat, command_queue &cq, memory_object_holder &src, memory_object_holder &dst, py::object py_byte_count, py::object py_wait_for) { PYOPENCL_PARSE_WAIT_FOR; ptrdiff_t byte_count = 0; if (py_byte_count.ptr() == Py_None) { size_t byte_count_src = 0; size_t byte_count_dst = 0; PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, (src.data(), CL_MEM_SIZE, sizeof(byte_count), &byte_count_src, 0)); PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, (dst.data(), CL_MEM_SIZE, sizeof(byte_count), &byte_count_dst, 0)); byte_count = std::min(byte_count_src, byte_count_dst); } else { byte_count = py::cast(py_byte_count); } clEnqueueCopyBufferP2PAMD_fn fn = (clEnqueueCopyBufferP2PAMD_fn)clGetExtensionFunctionAddressForPlatform(plat.data(), "clEnqueueCopyBufferP2PAMD"); if (!fn) throw pyopencl::error("clGetExtensionFunctionAddressForPlatform", CL_INVALID_VALUE, "clEnqueueCopyBufferP2PAMD is not available"); cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED(fn, ( cq.data(), src.data(), dst.data(), 0, 0, byte_count, PYOPENCL_WAITLIST_ARGS, &evt )) ); PYOPENCL_RETURN_NEW_EVENT(evt); } #endif // }}} // {{{ rectangular transfers #if PYOPENCL_CL_VERSION >= 0x1010 inline event *enqueue_read_buffer_rect( command_queue &cq, memory_object_holder &mem, py::object buffer, py::object py_buffer_origin, py::object py_host_origin, py::object py_region, py::object py_buffer_pitches, py::object py_host_pitches, py::object py_wait_for, bool is_blocking ) { PYOPENCL_PARSE_WAIT_FOR; COPY_PY_COORD_TRIPLE(buffer_origin); COPY_PY_COORD_TRIPLE(host_origin); COPY_PY_REGION_TRIPLE(region); COPY_PY_PITCH_TUPLE(buffer_pitches); COPY_PY_PITCH_TUPLE(host_pitches); void *buf; std::unique_ptr ward(new py_buffer_wrapper); ward->get(buffer.ptr(), PyBUF_ANY_CONTIGUOUS | PyBUF_WRITABLE); buf = ward->m_buf.buf; cl_command_queue queue = cq.data(); cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED_THREADED(clEnqueueReadBufferRect, ( queue, mem.data(), PYOPENCL_CAST_BOOL(is_blocking), buffer_origin, host_origin, region, buffer_pitches[0], buffer_pitches[1], host_pitches[0], host_pitches[1], buf, PYOPENCL_WAITLIST_ARGS, &evt )) ); PYOPENCL_RETURN_NEW_NANNY_EVENT(evt, ward); } inline event *enqueue_write_buffer_rect( command_queue &cq, memory_object_holder &mem, py::object buffer, py::object py_buffer_origin, py::object py_host_origin, py::object py_region, py::object py_buffer_pitches, py::object py_host_pitches, py::object py_wait_for, bool is_blocking ) { PYOPENCL_PARSE_WAIT_FOR; COPY_PY_COORD_TRIPLE(buffer_origin); COPY_PY_COORD_TRIPLE(host_origin); COPY_PY_REGION_TRIPLE(region); COPY_PY_PITCH_TUPLE(buffer_pitches); COPY_PY_PITCH_TUPLE(host_pitches); const void *buf; std::unique_ptr ward(new py_buffer_wrapper); ward->get(buffer.ptr(), PyBUF_ANY_CONTIGUOUS); buf = ward->m_buf.buf; cl_command_queue queue = cq.data(); cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED_THREADED(clEnqueueWriteBufferRect, ( queue, mem.data(), PYOPENCL_CAST_BOOL(is_blocking), buffer_origin, host_origin, region, buffer_pitches[0], buffer_pitches[1], host_pitches[0], host_pitches[1], buf, PYOPENCL_WAITLIST_ARGS, &evt )) ); PYOPENCL_RETURN_NEW_NANNY_EVENT(evt, ward); } inline event *enqueue_copy_buffer_rect( command_queue &cq, memory_object_holder &src, memory_object_holder &dst, py::object py_src_origin, py::object py_dst_origin, py::object py_region, py::object py_src_pitches, py::object py_dst_pitches, py::object py_wait_for) { PYOPENCL_PARSE_WAIT_FOR; COPY_PY_COORD_TRIPLE(src_origin); COPY_PY_COORD_TRIPLE(dst_origin); COPY_PY_REGION_TRIPLE(region); COPY_PY_PITCH_TUPLE(src_pitches); COPY_PY_PITCH_TUPLE(dst_pitches); cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED(clEnqueueCopyBufferRect, ( cq.data(), src.data(), dst.data(), src_origin, dst_origin, region, src_pitches[0], src_pitches[1], dst_pitches[0], dst_pitches[1], PYOPENCL_WAITLIST_ARGS, &evt )) ); PYOPENCL_RETURN_NEW_EVENT(evt); } #endif // }}} // }}} #if PYOPENCL_CL_VERSION >= 0x1020 inline event *enqueue_fill_buffer( command_queue &cq, memory_object_holder &mem, py::object pattern, size_t offset, size_t size, py::object py_wait_for ) { PYOPENCL_PARSE_WAIT_FOR; const void *pattern_buf; PYOPENCL_BUFFER_SIZE_T pattern_len; std::unique_ptr ward(new py_buffer_wrapper); ward->get(pattern.ptr(), PyBUF_ANY_CONTIGUOUS); pattern_buf = ward->m_buf.buf; pattern_len = ward->m_buf.len; cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED(clEnqueueFillBuffer, ( cq.data(), mem.data(), pattern_buf, pattern_len, offset, size, PYOPENCL_WAITLIST_ARGS, &evt )) ); PYOPENCL_RETURN_NEW_EVENT(evt); } #endif // }}} // {{{ image class image : public memory_object { public: image(cl_mem mem, bool retain, hostbuf_t hostbuf=hostbuf_t()) : memory_object(mem, retain, std::move(hostbuf)) { } py::object get_image_info(cl_image_info param_name) const { switch (param_name) { case CL_IMAGE_FORMAT: PYOPENCL_GET_TYPED_INFO(Image, data(), param_name, cl_image_format); case CL_IMAGE_ELEMENT_SIZE: case CL_IMAGE_ROW_PITCH: case CL_IMAGE_SLICE_PITCH: case CL_IMAGE_WIDTH: case CL_IMAGE_HEIGHT: case CL_IMAGE_DEPTH: #if PYOPENCL_CL_VERSION >= 0x1020 case CL_IMAGE_ARRAY_SIZE: #endif PYOPENCL_GET_TYPED_INFO(Image, data(), param_name, size_t); #if PYOPENCL_CL_VERSION >= 0x1020 case CL_IMAGE_BUFFER: { cl_mem param_value; PYOPENCL_CALL_GUARDED(clGetImageInfo, \ (data(), param_name, sizeof(param_value), ¶m_value, 0)); if (param_value == 0) { // no associated memory object? no problem. return py::none(); } return create_mem_object_wrapper(param_value, /* retain */ true); } case CL_IMAGE_NUM_MIP_LEVELS: case CL_IMAGE_NUM_SAMPLES: PYOPENCL_GET_TYPED_INFO(Image, data(), param_name, cl_uint); #endif default: throw error("Image.get_image_info", CL_INVALID_VALUE); } } }; // {{{ image formats inline void set_image_format(cl_image_format *self, cl_channel_order ord, cl_channel_type tp) { self->image_channel_order = ord; self->image_channel_data_type = tp; } inline py::list get_supported_image_formats( context const &ctx, cl_mem_flags flags, cl_mem_object_type image_type) { cl_uint num_image_formats; PYOPENCL_CALL_GUARDED(clGetSupportedImageFormats, ( ctx.data(), flags, image_type, 0, nullptr, &num_image_formats)); std::vector formats(num_image_formats); PYOPENCL_CALL_GUARDED(clGetSupportedImageFormats, ( ctx.data(), flags, image_type, formats.size(), formats.empty( ) ? nullptr : &formats.front(), nullptr)); PYOPENCL_RETURN_VECTOR(cl_image_format, formats); } inline cl_uint get_image_format_channel_count(cl_image_format const &fmt) { switch (fmt.image_channel_order) { case CL_R: return 1; case CL_A: return 1; case CL_RG: return 2; case CL_RA: return 2; case CL_RGB: return 3; case CL_RGBA: return 4; case CL_BGRA: return 4; case CL_INTENSITY: return 1; case CL_LUMINANCE: return 1; default: throw pyopencl::error("ImageFormat.channel_dtype_size", CL_INVALID_VALUE, "unrecognized channel order"); } } inline cl_uint get_image_format_channel_dtype_size(cl_image_format const &fmt) { switch (fmt.image_channel_data_type) { case CL_SNORM_INT8: return 1; case CL_SNORM_INT16: return 2; case CL_UNORM_INT8: return 1; case CL_UNORM_INT16: return 2; case CL_UNORM_SHORT_565: return 2; case CL_UNORM_SHORT_555: return 2; case CL_UNORM_INT_101010: return 4; case CL_SIGNED_INT8: return 1; case CL_SIGNED_INT16: return 2; case CL_SIGNED_INT32: return 4; case CL_UNSIGNED_INT8: return 1; case CL_UNSIGNED_INT16: return 2; case CL_UNSIGNED_INT32: return 4; case CL_HALF_FLOAT: return 2; case CL_FLOAT: return 4; default: throw pyopencl::error("ImageFormat.channel_dtype_size", CL_INVALID_VALUE, "unrecognized channel data type"); } } inline cl_uint get_image_format_item_size(cl_image_format const &fmt) { return get_image_format_channel_count(fmt) * get_image_format_channel_dtype_size(fmt); } // }}} // {{{ image creation inline void create_image( image *self, context const &ctx, cl_mem_flags flags, cl_image_format const &fmt, py::sequence shape, py::sequence pitches, py::object buffer) { if (shape.ptr() == Py_None) throw pyopencl::error("Image", CL_INVALID_VALUE, "'shape' must be given"); void *buf = 0; PYOPENCL_BUFFER_SIZE_T len = 0; std::unique_ptr retained_buf_obj; if (buffer.ptr() != Py_None) { retained_buf_obj = std::unique_ptr(new py_buffer_wrapper); int py_buf_flags = PyBUF_ANY_CONTIGUOUS; if ((flags & CL_MEM_USE_HOST_PTR) && ((flags & CL_MEM_READ_WRITE) || (flags & CL_MEM_WRITE_ONLY))) py_buf_flags |= PyBUF_WRITABLE; retained_buf_obj->get(buffer.ptr(), py_buf_flags); buf = retained_buf_obj->m_buf.buf; len = retained_buf_obj->m_buf.len; } unsigned dims = py::len(shape); cl_int status_code; cl_mem mem; if (dims == 2) { size_t width = py::cast(shape[0]); size_t height = py::cast(shape[1]); size_t pitch = 0; if (pitches.ptr() != Py_None) { if (py::len(pitches) != 1) throw pyopencl::error("Image", CL_INVALID_VALUE, "invalid length of pitch tuple"); pitch = py::cast(pitches[0]); } // check buffer size cl_int itemsize = get_image_format_item_size(fmt); if (buf && std::max(pitch, width*itemsize)*height > cl_uint(len)) throw pyopencl::error("Image", CL_INVALID_VALUE, "buffer too small"); PYOPENCL_PRINT_CALL_TRACE("clCreateImage2D"); PYOPENCL_RETRY_IF_MEM_ERROR( { mem = clCreateImage2D(ctx.data(), flags, &fmt, width, height, pitch, buf, &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("clCreateImage2D", status_code); } ); } else if (dims == 3) { size_t width = py::cast(shape[0]); size_t height = py::cast(shape[1]); size_t depth = py::cast(shape[2]); size_t pitch_x = 0; size_t pitch_y = 0; if (pitches.ptr() != Py_None) { if (py::len(pitches) != 2) throw pyopencl::error("Image", CL_INVALID_VALUE, "invalid length of pitch tuple"); pitch_x = py::cast(pitches[0]); pitch_y = py::cast(pitches[1]); } // check buffer size cl_int itemsize = get_image_format_item_size(fmt); if (buf && std::max(std::max(pitch_x, width*itemsize)*height, pitch_y) * depth > cl_uint(len)) throw pyopencl::error("Image", CL_INVALID_VALUE, "buffer too small"); PYOPENCL_PRINT_CALL_TRACE("clCreateImage3D"); PYOPENCL_RETRY_IF_MEM_ERROR( { mem = clCreateImage3D(ctx.data(), flags, &fmt, width, height, depth, pitch_x, pitch_y, buf, &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("clCreateImage3D", status_code); } ); } else throw pyopencl::error("Image", CL_INVALID_VALUE, "invalid dimension"); if (!(flags & CL_MEM_USE_HOST_PTR)) retained_buf_obj.reset(); try { new (self) image(mem, false, std::move(retained_buf_obj)); } catch (...) { PYOPENCL_CALL_GUARDED(clReleaseMemObject, (mem)); throw; } } #if PYOPENCL_CL_VERSION >= 0x1020 inline void create_image_from_desc( image *self, context const &ctx, cl_mem_flags flags, cl_image_format const &fmt, cl_image_desc &desc, py::object buffer) { if (buffer.ptr() != Py_None && !(flags & (CL_MEM_USE_HOST_PTR | CL_MEM_COPY_HOST_PTR))) PyErr_Warn(PyExc_UserWarning, "'hostbuf' was passed, " "but no memory flags to make use of it."); void *buf = 0; std::unique_ptr retained_buf_obj; if (buffer.ptr() != Py_None) { retained_buf_obj = std::unique_ptr(new py_buffer_wrapper); int py_buf_flags = PyBUF_ANY_CONTIGUOUS; if ((flags & CL_MEM_USE_HOST_PTR) && ((flags & CL_MEM_READ_WRITE) || (flags & CL_MEM_WRITE_ONLY))) py_buf_flags |= PyBUF_WRITABLE; retained_buf_obj->get(buffer.ptr(), py_buf_flags); buf = retained_buf_obj->m_buf.buf; } PYOPENCL_PRINT_CALL_TRACE("clCreateImage"); cl_int status_code; cl_mem mem = clCreateImage(ctx.data(), flags, &fmt, &desc, buf, &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("clCreateImage", status_code); if (!(flags & CL_MEM_USE_HOST_PTR)) retained_buf_obj.reset(); try { new (self) image(mem, false, std::move(retained_buf_obj)); } catch (...) { PYOPENCL_CALL_GUARDED(clReleaseMemObject, (mem)); throw; } } #endif // }}} // {{{ image transfers inline event *enqueue_read_image( command_queue &cq, image &img, py::object py_origin, py::object py_region, py::object buffer, size_t row_pitch, size_t slice_pitch, py::object py_wait_for, bool is_blocking) { PYOPENCL_PARSE_WAIT_FOR; COPY_PY_COORD_TRIPLE(origin); COPY_PY_REGION_TRIPLE(region); void *buf; std::unique_ptr ward(new py_buffer_wrapper); ward->get(buffer.ptr(), PyBUF_ANY_CONTIGUOUS | PyBUF_WRITABLE); buf = ward->m_buf.buf; cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED(clEnqueueReadImage, ( cq.data(), img.data(), PYOPENCL_CAST_BOOL(is_blocking), origin, region, row_pitch, slice_pitch, buf, PYOPENCL_WAITLIST_ARGS, &evt )); ); PYOPENCL_RETURN_NEW_NANNY_EVENT(evt, ward); } inline event *enqueue_write_image( command_queue &cq, image &img, py::object py_origin, py::object py_region, py::object buffer, size_t row_pitch, size_t slice_pitch, py::object py_wait_for, bool is_blocking) { PYOPENCL_PARSE_WAIT_FOR; COPY_PY_COORD_TRIPLE(origin); COPY_PY_REGION_TRIPLE(region); const void *buf; std::unique_ptr ward(new py_buffer_wrapper); ward->get(buffer.ptr(), PyBUF_ANY_CONTIGUOUS); buf = ward->m_buf.buf; cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED(clEnqueueWriteImage, ( cq.data(), img.data(), PYOPENCL_CAST_BOOL(is_blocking), origin, region, row_pitch, slice_pitch, buf, PYOPENCL_WAITLIST_ARGS, &evt )); ); PYOPENCL_RETURN_NEW_NANNY_EVENT(evt, ward); } inline event *enqueue_copy_image( command_queue &cq, memory_object_holder &src, memory_object_holder &dest, py::object py_src_origin, py::object py_dest_origin, py::object py_region, py::object py_wait_for ) { PYOPENCL_PARSE_WAIT_FOR; COPY_PY_COORD_TRIPLE(src_origin); COPY_PY_COORD_TRIPLE(dest_origin); COPY_PY_REGION_TRIPLE(region); cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED(clEnqueueCopyImage, ( cq.data(), src.data(), dest.data(), src_origin, dest_origin, region, PYOPENCL_WAITLIST_ARGS, &evt )); ); PYOPENCL_RETURN_NEW_EVENT(evt); } inline event *enqueue_copy_image_to_buffer( command_queue &cq, memory_object_holder &src, memory_object_holder &dest, py::object py_origin, py::object py_region, size_t offset, py::object py_wait_for ) { PYOPENCL_PARSE_WAIT_FOR; COPY_PY_COORD_TRIPLE(origin); COPY_PY_REGION_TRIPLE(region); cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED(clEnqueueCopyImageToBuffer, ( cq.data(), src.data(), dest.data(), origin, region, offset, PYOPENCL_WAITLIST_ARGS, &evt )); ); PYOPENCL_RETURN_NEW_EVENT(evt); } inline event *enqueue_copy_buffer_to_image( command_queue &cq, memory_object_holder &src, memory_object_holder &dest, size_t offset, py::object py_origin, py::object py_region, py::object py_wait_for ) { PYOPENCL_PARSE_WAIT_FOR; COPY_PY_COORD_TRIPLE(origin); COPY_PY_REGION_TRIPLE(region); cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED(clEnqueueCopyBufferToImage, ( cq.data(), src.data(), dest.data(), offset, origin, region, PYOPENCL_WAITLIST_ARGS, &evt )); ); PYOPENCL_RETURN_NEW_EVENT(evt); } // }}} #if PYOPENCL_CL_VERSION >= 0x1020 inline event *enqueue_fill_image( command_queue &cq, memory_object_holder &mem, py::object color, py::object py_origin, py::object py_region, py::object py_wait_for ) { PYOPENCL_PARSE_WAIT_FOR; COPY_PY_COORD_TRIPLE(origin); COPY_PY_REGION_TRIPLE(region); const void *color_buf; std::unique_ptr ward(new py_buffer_wrapper); ward->get(color.ptr(), PyBUF_ANY_CONTIGUOUS); color_buf = ward->m_buf.buf; cl_event evt; PYOPENCL_RETRY_IF_MEM_ERROR( PYOPENCL_CALL_GUARDED(clEnqueueFillImage, ( cq.data(), mem.data(), color_buf, origin, region, PYOPENCL_WAITLIST_ARGS, &evt )); ); PYOPENCL_RETURN_NEW_EVENT(evt); } #endif // }}} // {{{ pipe class pipe : public memory_object { public: pipe(cl_mem mem, bool retain) : memory_object(mem, retain) { } #if PYOPENCL_CL_VERSION < 0x2000 typedef void* cl_pipe_info; #endif py::object get_pipe_info(cl_pipe_info param_name) const { #if PYOPENCL_CL_VERSION >= 0x2000 switch (param_name) { case CL_PIPE_PACKET_SIZE: case CL_PIPE_MAX_PACKETS: PYOPENCL_GET_TYPED_INFO(Pipe, data(), param_name, cl_uint); default: throw error("Pipe.get_pipe_info", CL_INVALID_VALUE); } #else throw error("Pipes not available. PyOpenCL was not compiled against a CL2+ header.", CL_INVALID_VALUE); #endif } }; #if PYOPENCL_CL_VERSION >= 0x2000 inline void create_pipe( pipe *self, context const &ctx, cl_mem_flags flags, cl_uint pipe_packet_size, cl_uint pipe_max_packets, py::sequence py_props) { #if 0 PYOPENCL_STACK_CONTAINER(cl_pipe_properties, props, py::len(py_props) + 1); { size_t i = 0; for (auto prop: py_props) props[i++] = py::cast(prop); props[i++] = 0; } #endif if (py::len(py_props) != 0) throw pyopencl::error("Pipe", CL_INVALID_VALUE, "non-empty properties " "argument to Pipe not allowed"); cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreatePipe"); cl_mem mem = clCreatePipe( ctx.data(), flags, pipe_packet_size, pipe_max_packets, nullptr, &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("Pipe", status_code); try { new (self) pipe(mem, false); } catch (...) { PYOPENCL_CALL_GUARDED(clReleaseMemObject, (mem)); throw; } } #endif // }}} // {{{ maps class memory_map { private: bool m_valid; py::ref m_queue; memory_object m_mem; void *m_ptr; public: memory_map(py::ref cq, memory_object const &mem, void *ptr) : m_valid(true), m_queue(cq), m_mem(mem), m_ptr(ptr) { } ~memory_map() { if (m_valid) delete release(0, py::none()); } event *release(command_queue *cq, py::object py_wait_for) { PYOPENCL_PARSE_WAIT_FOR; if (cq == 0) cq = m_queue.get(); cl_event evt; PYOPENCL_CALL_GUARDED(clEnqueueUnmapMemObject, ( cq->data(), m_mem.data(), m_ptr, PYOPENCL_WAITLIST_ARGS, &evt )); m_valid = false; PYOPENCL_RETURN_NEW_EVENT(evt); } }; // FIXME: Reenable in pypy #ifndef PYPY_VERSION inline py::object enqueue_map_buffer( py::ref cq, memory_object_holder &buf, cl_map_flags flags, size_t offset, py::object py_shape, py::object dtype, py::object py_order, py::object py_strides, py::object py_wait_for, bool is_blocking ) { PYOPENCL_PARSE_WAIT_FOR; PYOPENCL_PARSE_NUMPY_ARRAY_SPEC; npy_uintp size_in_bytes = PyDataType_ELSIZE(tp_descr); for (npy_intp sdim: shape) size_in_bytes *= sdim; py::object result; PyArrayObject *result_arr; cl_event evt; cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clEnqueueMapBuffer"); void *mapped; PYOPENCL_RETRY_IF_MEM_ERROR( { { py::gil_scoped_release release; mapped = clEnqueueMapBuffer( cq->data(), buf.data(), PYOPENCL_CAST_BOOL(is_blocking), flags, offset, size_in_bytes, PYOPENCL_WAITLIST_ARGS, &evt, &status_code); } if (status_code != CL_SUCCESS) throw pyopencl::error("clEnqueueMapBuffer", status_code); } ); event evt_handle(evt, false); std::unique_ptr map; try { result = py::object(py::steal(PyArray_NewFromDescr( &PyArray_Type, tp_descr, shape.size(), shape.empty() ? nullptr : &shape.front(), strides.empty() ? nullptr : &strides.front(), mapped, ary_flags, /*obj*/nullptr))); result_arr = (PyArrayObject *) result.ptr(); if (size_in_bytes != (npy_uintp) PyArray_NBYTES(result_arr)) throw pyopencl::error("enqueue_map_buffer", CL_INVALID_VALUE, "miscalculated numpy array size (not contiguous?)"); map = std::unique_ptr(new memory_map(cq, buf, mapped)); } catch (...) { PYOPENCL_CALL_GUARDED_CLEANUP(clEnqueueUnmapMemObject, ( cq->data(), buf.data(), mapped, 0, 0, 0)); throw; } py::object map_py(handle_from_new_ptr(map.release())); PyArray_SetBaseObject(result_arr, map_py.ptr()); Py_INCREF(map_py.ptr()); return py::make_tuple( result, handle_from_new_ptr(new event(evt_handle))); } #endif // FIXME: Reenable in pypy #ifndef PYPY_VERSION inline py::object enqueue_map_image( py::ref cq, memory_object_holder &img, cl_map_flags flags, py::object py_origin, py::object py_region, py::object py_shape, py::object dtype, py::object py_order, py::object py_strides, py::object py_wait_for, bool is_blocking ) { PYOPENCL_PARSE_WAIT_FOR; PYOPENCL_PARSE_NUMPY_ARRAY_SPEC; COPY_PY_COORD_TRIPLE(origin); COPY_PY_REGION_TRIPLE(region); cl_event evt; cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clEnqueueMapImage"); size_t row_pitch, slice_pitch; void *mapped; PYOPENCL_RETRY_IF_MEM_ERROR( { { py::gil_scoped_release release; mapped = clEnqueueMapImage( cq->data(), img.data(), PYOPENCL_CAST_BOOL(is_blocking), flags, origin, region, &row_pitch, &slice_pitch, PYOPENCL_WAITLIST_ARGS, &evt, &status_code); } if (status_code != CL_SUCCESS) throw pyopencl::error("clEnqueueMapImage", status_code); } ); event evt_handle(evt, false); std::unique_ptr map; try { map = std::unique_ptr(new memory_map(cq, img, mapped)); } catch (...) { PYOPENCL_CALL_GUARDED_CLEANUP(clEnqueueUnmapMemObject, ( cq->data(), img.data(), mapped, 0, 0, 0)); throw; } py::object result = py::steal(PyArray_NewFromDescr( &PyArray_Type, tp_descr, shape.size(), shape.empty() ? nullptr : &shape.front(), strides.empty() ? nullptr : &strides.front(), mapped, ary_flags, /*obj*/nullptr)); PyArrayObject *result_arr = (PyArrayObject *) result.ptr(); py::object map_py(handle_from_new_ptr(map.release())); PyArray_SetBaseObject(result_arr, map_py.ptr()); Py_INCREF(map_py.ptr()); return py::make_tuple( result, handle_from_new_ptr(new event(evt_handle)), row_pitch, slice_pitch); } #endif // }}} #if PYOPENCL_CL_VERSION >= 0x2000 // {{{ svm pointer class size_not_available { }; class svm_pointer { public: virtual void *svm_ptr() const = 0; // may throw size_not_available virtual size_t size() const = 0; virtual ~svm_pointer() { } }; // }}} // {{{ svm_arg_wrapper class svm_arg_wrapper : public svm_pointer { private: void *m_ptr; PYOPENCL_BUFFER_SIZE_T m_size; std::unique_ptr ward; py::object m_mem; public: svm_arg_wrapper(py::object holder) : m_mem(holder) { ward = std::unique_ptr(new py_buffer_wrapper); #ifdef PYPY_VERSION // FIXME: get a read-only buffer // Not quite honest, but Pypy doesn't consider numpy arrays // created from objects with the __array_interface__ writeable. ward->get(holder.ptr(), PyBUF_ANY_CONTIGUOUS); #else ward->get(holder.ptr(), PyBUF_ANY_CONTIGUOUS | PyBUF_WRITABLE); #endif m_ptr = ward->m_buf.buf; m_size = ward->m_buf.len; } void *svm_ptr() const { return m_ptr; } size_t size() const { return m_size; } py::object mem() const { return m_mem; } }; // }}} // {{{ svm_allocation class svm_allocation : public svm_pointer { private: py::ref m_context; void *m_allocation; size_t m_size; command_queue_ref m_queue; // FIXME Should maybe also allow keeping a list of events so that we can // wait for users to finish in the case of out-of-order queues. public: svm_allocation(py::ref const &ctx, size_t size, cl_uint alignment, cl_svm_mem_flags flags, const command_queue *queue = nullptr) : m_context(ctx), m_size(size) { if (queue) { m_queue.set(queue->data()); if (is_queue_out_of_order(m_queue.data())) throw error("SVMAllocation.__init__", CL_INVALID_VALUE, "supplying an out-of-order queue to SVMAllocation is invalid"); } if (size) { int try_count = 0; while (try_count < 2) { PYOPENCL_PRINT_CALL_TRACE("clSVMalloc"); m_allocation = clSVMAlloc( ctx->data(), flags, size, alignment); if (m_allocation) return; ++try_count; run_python_gc(); } if (!m_allocation) throw pyopencl::error("clSVMAlloc", CL_OUT_OF_RESOURCES); } } svm_allocation(py::ref const &ctx, void *allocation, size_t size, const cl_command_queue queue) : m_context(ctx), m_allocation(allocation), m_size(size) { if (queue) { if (is_queue_out_of_order(queue)) { release(); throw error("SVMAllocation.__init__", CL_INVALID_VALUE, "supplying an out-of-order queue to SVMAllocation is invalid"); } m_queue.set(queue); } } svm_allocation(const svm_allocation &) = delete; svm_allocation &operator=(const svm_allocation &) = delete; ~svm_allocation() { if (m_allocation) release(); } void release() { if (m_size == 0) return; if (!m_allocation) throw error("SVMAllocation.release", CL_INVALID_VALUE, "trying to double-unref svm allocation"); if (m_queue.is_valid()) { PYOPENCL_CALL_GUARDED_CLEANUP(clEnqueueSVMFree, ( m_queue.data(), 1, &m_allocation, nullptr, nullptr, 0, nullptr, nullptr)); m_queue.reset(); } else { PYOPENCL_PRINT_CALL_TRACE("clSVMFree"); clSVMFree(m_context->data(), m_allocation); } m_allocation = nullptr; } event *enqueue_release(command_queue *queue, py::object py_wait_for) { PYOPENCL_PARSE_WAIT_FOR; if (m_size && !m_allocation) throw error("SVMAllocation.enqueue_release", CL_INVALID_VALUE, "trying to enqueue_release on an already-freed allocation"); cl_command_queue use_queue; if (queue) use_queue = queue->data(); else { if (m_queue.is_valid()) use_queue = m_queue.data(); else throw error("SVMAllocation.enqueue_release", CL_INVALID_VALUE, "no implicit queue available, must be provided explicitly"); } cl_event evt; if (m_size == 0) { // We need to get an event from somewhere... // We're using SVM, we must have 2.0 > 1.2. PYOPENCL_CALL_GUARDED_CLEANUP(clEnqueueMarkerWithWaitList, (use_queue, PYOPENCL_WAITLIST_ARGS, &evt)); } else { PYOPENCL_CALL_GUARDED_CLEANUP(clEnqueueSVMFree, ( use_queue, 1, &m_allocation, nullptr, nullptr, PYOPENCL_WAITLIST_ARGS, &evt)); } m_allocation = nullptr; PYOPENCL_RETURN_NEW_EVENT(evt); } void *svm_ptr() const { return m_allocation; } size_t size() const { return m_size; } bool operator==(svm_allocation const &other) const { return m_allocation == other.m_allocation; } bool operator!=(svm_allocation const &other) const { return m_allocation != other.m_allocation; } void bind_to_queue(command_queue const &queue) { if (is_queue_out_of_order(queue.data())) throw error("SVMAllocation.bind_to_queue", CL_INVALID_VALUE, "supplying an out-of-order queue to SVMAllocation is invalid"); if (m_queue.is_valid()) { if (m_queue.data() != queue.data()) { // make sure synchronization promises stay valid in new queue cl_event evt; PYOPENCL_CALL_GUARDED(clEnqueueMarker, (m_queue.data(), &evt)); PYOPENCL_CALL_GUARDED(clEnqueueMarkerWithWaitList, (queue.data(), 1, &evt, nullptr)); } } m_queue.set(queue.data()); } void unbind_from_queue() { if (m_queue.is_valid()) PYOPENCL_CALL_GUARDED_THREADED(clFinish, (m_queue.data())); m_queue.reset(); } // only use for testing/diagnostic/debugging purposes! cl_command_queue queue() const { if (m_queue.is_valid()) return m_queue.data(); else return nullptr; } }; // }}} // {{{ svm operations inline event *enqueue_svm_memcpy( command_queue &cq, cl_bool is_blocking, svm_pointer &dst, svm_pointer &src, py::object py_wait_for, py::object byte_count_py ) { PYOPENCL_PARSE_WAIT_FOR; // {{{ process size PYOPENCL_GET_SVM_SIZE(src); PYOPENCL_GET_SVM_SIZE(dst); size_t size = 0; bool have_size = false; if (src_has_size) { size = src_size; have_size = true; } if (dst_has_size) { if (have_size) { if (!byte_count_py.is_none()) size = std::min(size, dst_size); else if (size != dst_size) throw error("_enqueue_svm_memcpy", CL_INVALID_VALUE, "sizes of source and destination buffer do not match"); } else { size = dst_size; have_size = true; } } if (!byte_count_py.is_none()) { size_t byte_count = py::cast(byte_count_py); if (have_size && byte_count > size) throw error("_enqueue_svm_memcpy", CL_INVALID_VALUE, "specified byte_count larger than size of source or destination buffers"); size = byte_count; have_size = true; } if (!have_size) throw error("_enqueue_svm_memcpy", CL_INVALID_VALUE, "size not passed and could not be determined"); // }}} cl_event evt; PYOPENCL_CALL_GUARDED( clEnqueueSVMMemcpy, ( cq.data(), is_blocking, dst.svm_ptr(), src.svm_ptr(), size, PYOPENCL_WAITLIST_ARGS, &evt )); PYOPENCL_RETURN_NEW_EVENT(evt); } inline event *enqueue_svm_memfill( command_queue &cq, svm_pointer &dst, py::object py_pattern, py::object byte_count, py::object py_wait_for ) { PYOPENCL_PARSE_WAIT_FOR; const void *pattern_ptr; PYOPENCL_BUFFER_SIZE_T pattern_len; std::unique_ptr pattern_ward(new py_buffer_wrapper); pattern_ward->get(py_pattern.ptr(), PyBUF_ANY_CONTIGUOUS); pattern_ptr = pattern_ward->m_buf.buf; pattern_len = pattern_ward->m_buf.len; // {{{ process size PYOPENCL_GET_SVM_SIZE(dst); size_t size = 0; bool have_size = false; if (dst_has_size) { size = dst_size; have_size = true; } if (!byte_count.is_none()) { size_t user_size = py::cast(byte_count); if (have_size && user_size > size) throw error("enqueue_svm_memfill", CL_INVALID_VALUE, "byte_count too large for specified SVM buffer"); } if (!have_size) { throw error("enqueue_svm_memfill", CL_INVALID_VALUE, "byte_count not passed and could not be determined"); } // }}} cl_event evt; PYOPENCL_CALL_GUARDED( clEnqueueSVMMemFill, ( cq.data(), dst.svm_ptr(), pattern_ptr, pattern_len, size, PYOPENCL_WAITLIST_ARGS, &evt )); PYOPENCL_RETURN_NEW_EVENT(evt); } inline event *enqueue_svm_map( command_queue &cq, cl_bool is_blocking, cl_map_flags flags, svm_pointer &svm, py::object py_wait_for, py::object user_size_py ) { PYOPENCL_PARSE_WAIT_FOR; // {{{ process size PYOPENCL_GET_SVM_SIZE(svm); size_t size = 0; bool have_size = false; if (svm_has_size) { size = svm_size; have_size = true; } if (!user_size_py.is_none()) { size_t user_size = py::cast(user_size_py); if (have_size && user_size > size) throw error("enqueue_svm_memfill", CL_INVALID_VALUE, "user-provided size too large for specified SVM buffer"); } if (!have_size) { throw error("enqueue_svm_mem_map", CL_INVALID_VALUE, "size not passed and could not be determined"); } // }}} cl_event evt; PYOPENCL_CALL_GUARDED( clEnqueueSVMMap, ( cq.data(), is_blocking, flags, svm.svm_ptr(), size, PYOPENCL_WAITLIST_ARGS, &evt )); PYOPENCL_RETURN_NEW_EVENT(evt); } inline event *enqueue_svm_unmap( command_queue &cq, svm_pointer &svm, py::object py_wait_for ) { PYOPENCL_PARSE_WAIT_FOR; cl_event evt; PYOPENCL_CALL_GUARDED( clEnqueueSVMUnmap, ( cq.data(), svm.svm_ptr(), PYOPENCL_WAITLIST_ARGS, &evt )); PYOPENCL_RETURN_NEW_EVENT(evt); } #endif #if PYOPENCL_CL_VERSION >= 0x2010 inline event *enqueue_svm_migratemem( command_queue &cq, py::sequence svms, cl_mem_migration_flags flags, py::object py_wait_for ) { PYOPENCL_PARSE_WAIT_FOR; std::vector svm_pointers; std::vector sizes; for (py::handle py_svm: svms) { svm_pointer &svm(py::cast(py_svm)); svm_pointers.push_back(svm.svm_ptr()); sizes.push_back(svm.size()); } cl_event evt; PYOPENCL_CALL_GUARDED( clEnqueueSVMMigrateMem, ( cq.data(), svm_pointers.size(), svm_pointers.empty() ? nullptr : &svm_pointers.front(), sizes.empty() ? nullptr : &sizes.front(), flags, PYOPENCL_WAITLIST_ARGS, &evt )); PYOPENCL_RETURN_NEW_EVENT(evt); } #endif // }}} // {{{ sampler class sampler : noncopyable { private: cl_sampler m_sampler; public: #if PYOPENCL_CL_VERSION >= 0x2000 sampler(context const &ctx, py::sequence py_props) { int hex_plat_version = ctx.get_hex_platform_version(); if (hex_plat_version < 0x2000) { std::cerr << "sampler properties given as an iterable, " "which uses an OpenCL 2+-only interface, " "but the context's platform does not " "declare OpenCL 2 support. Proceeding " "as requested, but the next thing you see " "may be a crash." << std:: endl; } PYOPENCL_STACK_CONTAINER(cl_sampler_properties, props, py::len(py_props) + 1); { size_t i = 0; for (auto prop: py_props) props[i++] = py::cast(prop); props[i++] = 0; } cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateSamplerWithProperties"); m_sampler = clCreateSamplerWithProperties( ctx.data(), PYOPENCL_STACK_CONTAINER_GET_PTR(props), &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("Sampler", status_code); } #endif sampler(context const &ctx, bool normalized_coordinates, cl_addressing_mode am, cl_filter_mode fm) { PYOPENCL_PRINT_CALL_TRACE("clCreateSampler"); int hex_plat_version = ctx.get_hex_platform_version(); #if PYOPENCL_CL_VERSION >= 0x2000 if (hex_plat_version >= 0x2000) { cl_sampler_properties props_list[] = { CL_SAMPLER_NORMALIZED_COORDS, normalized_coordinates, CL_SAMPLER_ADDRESSING_MODE, am, CL_SAMPLER_FILTER_MODE, fm, 0, }; cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateSamplerWithProperties"); m_sampler = clCreateSamplerWithProperties( ctx.data(), props_list, &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("Sampler", status_code); } else #endif { cl_int status_code; #if defined(__GNUG__) && !defined(__clang__) #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wdeprecated-declarations" #endif m_sampler = clCreateSampler( ctx.data(), normalized_coordinates, am, fm, &status_code); #if defined(__GNUG__) && !defined(__clang__) #pragma GCC diagnostic pop #endif if (status_code != CL_SUCCESS) throw pyopencl::error("Sampler", status_code); } } sampler(cl_sampler samp, bool retain) : m_sampler(samp) { if (retain) PYOPENCL_CALL_GUARDED(clRetainSampler, (samp)); } ~sampler() { PYOPENCL_CALL_GUARDED_CLEANUP(clReleaseSampler, (m_sampler)); } cl_sampler data() const { return m_sampler; } PYOPENCL_EQUALITY_TESTS(sampler); py::object get_info(cl_sampler_info param_name) const { switch (param_name) { case CL_SAMPLER_REFERENCE_COUNT: PYOPENCL_GET_TYPED_INFO(Sampler, m_sampler, param_name, cl_uint); case CL_SAMPLER_CONTEXT: PYOPENCL_GET_OPAQUE_INFO(Sampler, m_sampler, param_name, cl_context, context); case CL_SAMPLER_ADDRESSING_MODE: PYOPENCL_GET_TYPED_INFO(Sampler, m_sampler, param_name, cl_addressing_mode); case CL_SAMPLER_FILTER_MODE: PYOPENCL_GET_TYPED_INFO(Sampler, m_sampler, param_name, cl_filter_mode); case CL_SAMPLER_NORMALIZED_COORDS: PYOPENCL_GET_TYPED_INFO(Sampler, m_sampler, param_name, cl_bool); #if PYOPENCL_CL_VERSION >= 0x3000 case CL_SAMPLER_PROPERTIES: { std::vector result; PYOPENCL_GET_VEC_INFO(Sampler, m_sampler, param_name, result); PYOPENCL_RETURN_VECTOR(cl_sampler_properties, result); } #endif #ifdef CL_SAMPLER_MIP_FILTER_MODE_KHR case CL_SAMPLER_MIP_FILTER_MODE_KHR: PYOPENCL_GET_TYPED_INFO(Sampler, m_sampler, param_name, cl_filter_mode); case CL_SAMPLER_LOD_MIN_KHR: case CL_SAMPLER_LOD_MAX_KHR: PYOPENCL_GET_TYPED_INFO(Sampler, m_sampler, param_name, float); #endif default: throw error("Sampler.get_info", CL_INVALID_VALUE); } } }; // }}} // {{{ program class program : noncopyable { public: enum program_kind_type { KND_UNKNOWN, KND_SOURCE, KND_BINARY, KND_IL }; private: cl_program m_program; program_kind_type m_program_kind; public: program(cl_program prog, bool retain, program_kind_type progkind=KND_UNKNOWN) : m_program(prog), m_program_kind(progkind) { if (retain) PYOPENCL_CALL_GUARDED(clRetainProgram, (prog)); } ~program() { PYOPENCL_CALL_GUARDED_CLEANUP(clReleaseProgram, (m_program)); } cl_program data() const { return m_program; } program_kind_type kind() const { return m_program_kind; } PYOPENCL_EQUALITY_TESTS(program); py::object get_info(cl_program_info param_name) const { switch (param_name) { case CL_PROGRAM_REFERENCE_COUNT: PYOPENCL_GET_TYPED_INFO(Program, m_program, param_name, cl_uint); case CL_PROGRAM_CONTEXT: PYOPENCL_GET_OPAQUE_INFO(Program, m_program, param_name, cl_context, context); case CL_PROGRAM_NUM_DEVICES: PYOPENCL_GET_TYPED_INFO(Program, m_program, param_name, cl_uint); case CL_PROGRAM_DEVICES: { std::vector result; PYOPENCL_GET_VEC_INFO(Program, m_program, param_name, result); py::list py_result; for (cl_device_id did: result) py_result.append(handle_from_new_ptr( new pyopencl::device(did))); return py_result; } case CL_PROGRAM_SOURCE: PYOPENCL_GET_STR_INFO(Program, m_program, param_name); case CL_PROGRAM_BINARY_SIZES: { std::vector result; PYOPENCL_GET_VEC_INFO(Program, m_program, param_name, result); PYOPENCL_RETURN_VECTOR(size_t, result); } case CL_PROGRAM_BINARIES: // {{{ { std::vector sizes; PYOPENCL_GET_VEC_INFO(Program, m_program, CL_PROGRAM_BINARY_SIZES, sizes); size_t total_size = std::accumulate(sizes.begin(), sizes.end(), 0); std::unique_ptr result( new unsigned char[total_size]); std::vector result_ptrs; unsigned char *ptr = result.get(); for (unsigned i = 0; i < sizes.size(); ++i) { result_ptrs.push_back(ptr); ptr += sizes[i]; } PYOPENCL_CALL_GUARDED(clGetProgramInfo, (m_program, param_name, sizes.size()*sizeof(unsigned char *), result_ptrs.empty( ) ? nullptr : &result_ptrs.front(), 0)); \ py::list py_result; ptr = result.get(); for (unsigned i = 0; i < sizes.size(); ++i) { py::object binary_pyobj( py::steal( #if PY_VERSION_HEX >= 0x03000000 PyBytes_FromStringAndSize( reinterpret_cast(ptr), sizes[i]) #else PyString_FromStringAndSize( reinterpret_cast(ptr), sizes[i]) #endif )); py_result.append(binary_pyobj); ptr += sizes[i]; } return py_result; } // }}} #if PYOPENCL_CL_VERSION >= 0x1020 case CL_PROGRAM_NUM_KERNELS: PYOPENCL_GET_TYPED_INFO(Program, m_program, param_name, size_t); case CL_PROGRAM_KERNEL_NAMES: PYOPENCL_GET_STR_INFO(Program, m_program, param_name); #endif #if PYOPENCL_CL_VERSION >= 0x2010 case CL_PROGRAM_IL: PYOPENCL_GET_STR_INFO(Program, m_program, param_name); #endif #if PYOPENCL_CL_VERSION >= 0x2020 case CL_PROGRAM_SCOPE_GLOBAL_CTORS_PRESENT: case CL_PROGRAM_SCOPE_GLOBAL_DTORS_PRESENT: PYOPENCL_GET_TYPED_INFO(Program, m_program, param_name, cl_bool); #endif default: throw error("Program.get_info", CL_INVALID_VALUE); } } py::object get_build_info( device const &dev, cl_program_build_info param_name) const { switch (param_name) { #define PYOPENCL_FIRST_ARG m_program, dev.data() // hackety hack case CL_PROGRAM_BUILD_STATUS: PYOPENCL_GET_TYPED_INFO(ProgramBuild, PYOPENCL_FIRST_ARG, param_name, cl_build_status); case CL_PROGRAM_BUILD_OPTIONS: case CL_PROGRAM_BUILD_LOG: PYOPENCL_GET_STR_INFO(ProgramBuild, PYOPENCL_FIRST_ARG, param_name); #if PYOPENCL_CL_VERSION >= 0x1020 case CL_PROGRAM_BINARY_TYPE: PYOPENCL_GET_TYPED_INFO(ProgramBuild, PYOPENCL_FIRST_ARG, param_name, cl_program_binary_type); #endif #if PYOPENCL_CL_VERSION >= 0x2000 case CL_PROGRAM_BUILD_GLOBAL_VARIABLE_TOTAL_SIZE: PYOPENCL_GET_TYPED_INFO(ProgramBuild, PYOPENCL_FIRST_ARG, param_name, size_t); #endif #undef PYOPENCL_FIRST_ARG default: throw error("Program.get_build_info", CL_INVALID_VALUE); } } void build(py::bytes options, py::object py_devices) { PYOPENCL_PARSE_PY_DEVICES; PYOPENCL_CALL_GUARDED_THREADED(clBuildProgram, (m_program, num_devices, devices, options.c_str(), 0 ,0)); } #if PYOPENCL_CL_VERSION >= 0x1020 void compile(py::bytes options, py::object py_devices, py::object py_headers) { PYOPENCL_PARSE_PY_DEVICES; // {{{ pick apart py_headers // py_headers is a list of tuples *(name, program)* std::vector header_names; std::vector programs; for (py::handle name_hdr_tup_py: py_headers) { py::tuple name_hdr_tup = py::borrow(name_hdr_tup_py); if (py::len(name_hdr_tup) != 2) throw error("Program.compile", CL_INVALID_VALUE, "expected (name, header) tuple in headers list"); std::string name = py::cast(name_hdr_tup[0]); program &prg = py::cast(name_hdr_tup[1]); header_names.push_back(name); programs.push_back(prg.data()); } std::vector header_name_ptrs; for (std::string const &name: header_names) header_name_ptrs.push_back(name.c_str()); // }}} PYOPENCL_CALL_GUARDED_THREADED(clCompileProgram, (m_program, num_devices, devices, options.c_str(), header_names.size(), programs.empty() ? nullptr : &programs.front(), header_name_ptrs.empty() ? nullptr : &header_name_ptrs.front(), 0, 0)); } #endif #if PYOPENCL_CL_VERSION >= 0x2020 void set_specialization_constant(cl_uint spec_id, py::object py_buffer) { py_buffer_wrapper bufwrap; bufwrap.get(py_buffer.ptr(), PyBUF_ANY_CONTIGUOUS); PYOPENCL_CALL_GUARDED(clSetProgramSpecializationConstant, (m_program, spec_id, bufwrap.m_buf.len, bufwrap.m_buf.buf)); } #endif }; inline void create_program_with_source( program *self, context &ctx, std::string const &src) { const char *string = src.c_str(); size_t length = src.size(); cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateProgramWithSource"); cl_program result = clCreateProgramWithSource( ctx.data(), 1, &string, &length, &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("clCreateProgramWithSource", status_code); try { new (self) program(result, false, program::KND_SOURCE); } catch (...) { clReleaseProgram(result); throw; } } inline void create_program_with_binary( program *self, context &ctx, py::sequence py_devices, py::sequence py_binaries) { std::vector devices; std::vector binaries; std::vector sizes; size_t num_devices = len(py_devices); if (len(py_binaries) != num_devices) throw error("create_program_with_binary", CL_INVALID_VALUE, "device and binary counts don't match"); for (size_t i = 0; i < num_devices; ++i) { devices.push_back(py::cast(py_devices[i]).data()); const void *buf; PYOPENCL_BUFFER_SIZE_T len; py_buffer_wrapper buf_wrapper; buf_wrapper.get(py::object(py_binaries[i]).ptr(), PyBUF_ANY_CONTIGUOUS); buf = buf_wrapper.m_buf.buf; len = buf_wrapper.m_buf.len; binaries.push_back(reinterpret_cast(buf)); sizes.push_back(len); } PYOPENCL_STACK_CONTAINER(cl_int, binary_statuses, num_devices); cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateProgramWithBinary"); cl_program result = clCreateProgramWithBinary( ctx.data(), num_devices, devices.empty( ) ? nullptr : &devices.front(), sizes.empty( ) ? nullptr : &sizes.front(), binaries.empty( ) ? nullptr : &binaries.front(), PYOPENCL_STACK_CONTAINER_GET_PTR(binary_statuses), &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("clCreateProgramWithBinary", status_code); /* for (int i = 0; i < num_devices; ++i) printf("%d:%d\n", i, binary_statuses[i]); */ try { new (self) program(result, false, program::KND_BINARY); } catch (...) { clReleaseProgram(result); throw; } } #if (PYOPENCL_CL_VERSION >= 0x1020) || \ ((PYOPENCL_CL_VERSION >= 0x1030) && defined(__APPLE__)) inline program *create_program_with_built_in_kernels( context &ctx, py::object py_devices, std::string const &kernel_names) { PYOPENCL_PARSE_PY_DEVICES; cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateProgramWithBuiltInKernels"); cl_program result = clCreateProgramWithBuiltInKernels( ctx.data(), num_devices, devices, kernel_names.c_str(), &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("clCreateProgramWithBuiltInKernels", status_code); try { return new program(result, false); } catch (...) { clReleaseProgram(result); throw; } } #endif #if (PYOPENCL_CL_VERSION >= 0x2010) inline program *create_program_with_il( context &ctx, py::bytes const &src) { cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateProgramWithIL"); cl_program result = clCreateProgramWithIL( ctx.data(), src.c_str(), src.size(), &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("clCreateProgramWithIL", status_code); try { return new program(result, false, program::KND_IL); } catch (...) { clReleaseProgram(result); throw; } } #endif #if PYOPENCL_CL_VERSION >= 0x1020 inline program *link_program( context &ctx, py::object py_programs, py::bytes options, py::object py_devices ) { PYOPENCL_PARSE_PY_DEVICES; std::vector programs; for (py::handle py_prg: py_programs) { program &prg = py::cast(py_prg); programs.push_back(prg.data()); } cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clLinkProgram"); cl_program result = clLinkProgram( ctx.data(), num_devices, devices, options.c_str(), programs.size(), programs.empty() ? nullptr : &programs.front(), 0, 0, &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("clLinkProgram", result, status_code); try { return new program(result, false); } catch (...) { clReleaseProgram(result); throw; } } #endif #if PYOPENCL_CL_VERSION >= 0x1020 inline void unload_platform_compiler(platform &plat) { PYOPENCL_CALL_GUARDED(clUnloadPlatformCompiler, (plat.data())); } #endif // }}} // {{{ kernel class local_memory { private: size_t m_size; public: local_memory(size_t size) : m_size(size) { } size_t size() const { return m_size; } }; class kernel : noncopyable { private: cl_kernel m_kernel; bool m_set_arg_prefer_svm; // Source is a Python object so that we can hold a reference to the source object // without a need to copy it. // // Not implementing GC traversals for this because (IMO) it's // unlikely the source string is involved in a cycle with the // kernel object. py::object m_source; // These are generated code, unlikely to hold a reference back to the // kernel, therefore also not implementing GC traversal for this. py::object m_enqueue_func; py::object m_set_args_func; public: kernel(cl_kernel knl, bool retain) : m_kernel(knl), m_set_arg_prefer_svm(false) { if (retain) PYOPENCL_CALL_GUARDED(clRetainKernel, (knl)); set_up_basic_invokers(); } kernel(py::object prg_py, std::string const &kernel_name) : m_set_arg_prefer_svm(false) { program const *prg = nullptr; try { prg = py::cast(prg_py); } catch (py::cast_error) { prg = py::cast(prg_py.attr("_get_prg")()); } cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCreateKernel"); m_kernel = clCreateKernel(prg->data(), kernel_name.c_str(), &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("clCreateKernel", status_code); m_source = py::getattr(prg_py, "_source", py::object()); set_up_basic_invokers(); } ~kernel() { PYOPENCL_CALL_GUARDED_CLEANUP(clReleaseKernel, (m_kernel)); } cl_kernel data() const { return m_kernel; } py::object source() const { return m_source; } PYOPENCL_EQUALITY_TESTS(kernel); #if PYOPENCL_CL_VERSION >= 0x2010 kernel *clone() { cl_int status_code; PYOPENCL_PRINT_CALL_TRACE("clCloneKernel"); cl_kernel result = clCloneKernel(m_kernel, &status_code); if (status_code != CL_SUCCESS) throw pyopencl::error("clCloneKernel", status_code); try { return new kernel(result, /* retain */ false); } catch (...) { PYOPENCL_CALL_GUARDED_CLEANUP(clReleaseKernel, (result)); throw; } } #endif void set_arg_null(cl_uint arg_index) { cl_mem m = 0; PYOPENCL_CALL_GUARDED(clSetKernelArg, (m_kernel, arg_index, sizeof(cl_mem), &m)); } void set_arg_mem(cl_uint arg_index, memory_object_holder &moh) { cl_mem m = moh.data(); PYOPENCL_CALL_GUARDED(clSetKernelArg, (m_kernel, arg_index, sizeof(cl_mem), &m)); } void set_arg_local(cl_uint arg_index, local_memory const &loc) { PYOPENCL_CALL_GUARDED(clSetKernelArg, (m_kernel, arg_index, loc.size(), 0)); } void set_arg_sampler(cl_uint arg_index, sampler const &smp) { cl_sampler s = smp.data(); PYOPENCL_CALL_GUARDED(clSetKernelArg, (m_kernel, arg_index, sizeof(cl_sampler), &s)); } void set_arg_command_queue(cl_uint arg_index, command_queue const &queue) { cl_command_queue q = queue.data(); PYOPENCL_CALL_GUARDED(clSetKernelArg, (m_kernel, arg_index, sizeof(cl_command_queue), &q)); } void set_arg_buf_pack(cl_uint arg_index, py::handle py_typechar, py::handle obj) { py::bytes typechar_str(py::cast(py_typechar)); if (typechar_str.size() != 1) throw error("Kernel.set_arg_buf_pack", CL_INVALID_VALUE, "type char argument must have exactly one character"); char typechar = *typechar_str.c_str(); #define PYOPENCL_KERNEL_PACK_AND_SET_ARG(TYPECH_VAL, TYPE, CAST_TYPE) \ case TYPECH_VAL: \ { \ TYPE val = (TYPE) py::cast(obj); \ PYOPENCL_CALL_GUARDED(clSetKernelArg, (m_kernel, arg_index, sizeof(val), &val)); \ break; \ } switch (typechar) { // FIXME: nanobind thinks of char as "short string", not number // The detour via 'int' may lose data. PYOPENCL_KERNEL_PACK_AND_SET_ARG('c', char, int) PYOPENCL_KERNEL_PACK_AND_SET_ARG('b', signed char, int) PYOPENCL_KERNEL_PACK_AND_SET_ARG('B', unsigned char, int) PYOPENCL_KERNEL_PACK_AND_SET_ARG('h', short, short) PYOPENCL_KERNEL_PACK_AND_SET_ARG('H', unsigned short, unsigned short) PYOPENCL_KERNEL_PACK_AND_SET_ARG('i', int, int) PYOPENCL_KERNEL_PACK_AND_SET_ARG('I', unsigned int, unsigned int) PYOPENCL_KERNEL_PACK_AND_SET_ARG('l', long, long) PYOPENCL_KERNEL_PACK_AND_SET_ARG('L', unsigned long, unsigned long) PYOPENCL_KERNEL_PACK_AND_SET_ARG('f', float, float) PYOPENCL_KERNEL_PACK_AND_SET_ARG('d', double, double) default: throw error("Kernel.set_arg_buf_pack", CL_INVALID_VALUE, "invalid type char"); } #undef PYOPENCL_KERNEL_PACK_AND_SET_ARG } void set_arg_buf(cl_uint arg_index, py::handle py_buffer) { const void *buf; PYOPENCL_BUFFER_SIZE_T len; py_buffer_wrapper buf_wrapper; try { buf_wrapper.get(py_buffer.ptr(), PyBUF_ANY_CONTIGUOUS); } catch (py::python_error &) { PyErr_Clear(); throw error("Kernel.set_arg", CL_INVALID_VALUE, "invalid kernel argument"); } buf = buf_wrapper.m_buf.buf; len = buf_wrapper.m_buf.len; PYOPENCL_CALL_GUARDED(clSetKernelArg, (m_kernel, arg_index, len, buf)); } #if PYOPENCL_CL_VERSION >= 0x2000 void set_arg_svm(cl_uint arg_index, svm_pointer const &wrp) { PYOPENCL_CALL_GUARDED(clSetKernelArgSVMPointer, (m_kernel, arg_index, wrp.svm_ptr())); } #endif void set_arg(cl_uint arg_index, py::handle arg) { if (arg.ptr() == Py_None) { set_arg_null(arg_index); return; } // It turns out that a taken 'catch' has a relatively high cost, so // in deciding which of "mem object" and "svm" to try first, we use // whatever we were given last time around. if (m_set_arg_prefer_svm) { #if PYOPENCL_CL_VERSION >= 0x2000 try { set_arg_svm(arg_index, py::cast(arg)); return; } catch (py::cast_error &) { } #endif try { set_arg_mem(arg_index, py::cast(arg)); m_set_arg_prefer_svm = false; return; } catch (py::cast_error &) { } } else { try { set_arg_mem(arg_index, py::cast(arg)); return; } catch (py::cast_error &) { } #if PYOPENCL_CL_VERSION >= 0x2000 try { set_arg_svm(arg_index, py::cast(arg)); m_set_arg_prefer_svm = true; return; } catch (py::cast_error &) { } #endif } try { set_arg_local(arg_index, py::cast(arg)); return; } catch (py::cast_error &) { } try { set_arg_sampler(arg_index, py::cast(arg)); return; } catch (py::cast_error &) { } try { set_arg_command_queue(arg_index, py::cast(arg)); return; } catch (py::cast_error &) { } set_arg_buf(arg_index, arg); } py::object get_info(cl_kernel_info param_name) const { switch (param_name) { case CL_KERNEL_FUNCTION_NAME: PYOPENCL_GET_STR_INFO(Kernel, m_kernel, param_name); case CL_KERNEL_NUM_ARGS: case CL_KERNEL_REFERENCE_COUNT: PYOPENCL_GET_TYPED_INFO(Kernel, m_kernel, param_name, cl_uint); case CL_KERNEL_CONTEXT: PYOPENCL_GET_OPAQUE_INFO(Kernel, m_kernel, param_name, cl_context, context); case CL_KERNEL_PROGRAM: PYOPENCL_GET_OPAQUE_INFO(Kernel, m_kernel, param_name, cl_program, program); #if PYOPENCL_CL_VERSION >= 0x1020 case CL_KERNEL_ATTRIBUTES: PYOPENCL_GET_STR_INFO(Kernel, m_kernel, param_name); #endif default: throw error("Kernel.get_info", CL_INVALID_VALUE); } } py::object get_work_group_info( cl_kernel_work_group_info param_name, device const &dev ) const { switch (param_name) { #define PYOPENCL_FIRST_ARG m_kernel, dev.data() // hackety hack case CL_KERNEL_WORK_GROUP_SIZE: PYOPENCL_GET_TYPED_INFO(KernelWorkGroup, PYOPENCL_FIRST_ARG, param_name, size_t); case CL_KERNEL_COMPILE_WORK_GROUP_SIZE: { std::vector result; PYOPENCL_GET_VEC_INFO(KernelWorkGroup, PYOPENCL_FIRST_ARG, param_name, result); PYOPENCL_RETURN_VECTOR(size_t, result); } case CL_KERNEL_LOCAL_MEM_SIZE: #if PYOPENCL_CL_VERSION >= 0x1010 case CL_KERNEL_PRIVATE_MEM_SIZE: #endif PYOPENCL_GET_TYPED_INFO(KernelWorkGroup, PYOPENCL_FIRST_ARG, param_name, cl_ulong); #if PYOPENCL_CL_VERSION >= 0x1010 case CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE: PYOPENCL_GET_TYPED_INFO(KernelWorkGroup, PYOPENCL_FIRST_ARG, param_name, size_t); #endif default: throw error("Kernel.get_work_group_info", CL_INVALID_VALUE); #undef PYOPENCL_FIRST_ARG } } #if PYOPENCL_CL_VERSION >= 0x1020 py::object get_arg_info( cl_uint arg_index, cl_kernel_arg_info param_name ) const { switch (param_name) { #define PYOPENCL_FIRST_ARG m_kernel, arg_index // hackety hack case CL_KERNEL_ARG_ADDRESS_QUALIFIER: PYOPENCL_GET_TYPED_INFO(KernelArg, PYOPENCL_FIRST_ARG, param_name, cl_kernel_arg_address_qualifier); case CL_KERNEL_ARG_ACCESS_QUALIFIER: PYOPENCL_GET_TYPED_INFO(KernelArg, PYOPENCL_FIRST_ARG, param_name, cl_kernel_arg_access_qualifier); case CL_KERNEL_ARG_TYPE_NAME: case CL_KERNEL_ARG_NAME: PYOPENCL_GET_STR_INFO(KernelArg, PYOPENCL_FIRST_ARG, param_name); case CL_KERNEL_ARG_TYPE_QUALIFIER: PYOPENCL_GET_TYPED_INFO(KernelArg, PYOPENCL_FIRST_ARG, param_name, cl_kernel_arg_type_qualifier); #undef PYOPENCL_FIRST_ARG default: throw error("Kernel.get_arg_info", CL_INVALID_VALUE); } } #endif #if PYOPENCL_CL_VERSION >= 0x2010 py::object get_sub_group_info( device const &dev, cl_kernel_sub_group_info param_name, py::object py_input_value) { switch (param_name) { // size_t * -> size_t case CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE: case CL_KERNEL_SUB_GROUP_COUNT_FOR_NDRANGE: { std::vector input_value; COPY_PY_LIST(size_t, input_value); size_t param_value; PYOPENCL_CALL_GUARDED(clGetKernelSubGroupInfo, (m_kernel, dev.data(), param_name, input_value.size()*sizeof(input_value.front()), input_value.empty() ? nullptr : &input_value.front(), sizeof(param_value), ¶m_value, 0)); return py::cast(param_value); } // size_t -> size_t[] case CL_KERNEL_LOCAL_SIZE_FOR_SUB_GROUP_COUNT: { size_t input_value = py::cast(py_input_value); std::vector result; size_t size; PYOPENCL_CALL_GUARDED(clGetKernelSubGroupInfo, (m_kernel, dev.data(), param_name, sizeof(input_value), &input_value, 0, nullptr, &size)); result.resize(size / sizeof(result.front())); PYOPENCL_CALL_GUARDED(clGetKernelSubGroupInfo, (m_kernel, dev.data(), param_name, sizeof(input_value), &input_value, size, result.empty() ? nullptr : &result.front(), 0)); PYOPENCL_RETURN_VECTOR(size_t, result); } // () -> size_t case CL_KERNEL_MAX_NUM_SUB_GROUPS: case CL_KERNEL_COMPILE_NUM_SUB_GROUPS: { size_t param_value; PYOPENCL_CALL_GUARDED(clGetKernelSubGroupInfo, (m_kernel, dev.data(), param_name, 0, nullptr, sizeof(param_value), ¶m_value, 0)); return py::cast(param_value); } default: throw error("Kernel.get_sub_group_info", CL_INVALID_VALUE); } } #endif void set_up_basic_invokers() { py::module_ invoker = py::module_::import_("pyopencl.invoker"); py::tuple res = py::cast(invoker.attr("generate_enqueue_and_set_args")( get_info(CL_KERNEL_FUNCTION_NAME), num_args(), num_args(), py::none(), "warn_about_arg_count_bug"_a=py::none(), "work_around_arg_count_bug"_a=py::none(), "devs"_a=get_info(CL_KERNEL_CONTEXT).attr("devices") )); m_enqueue_func = res[0]; m_set_args_func = res[1]; } void set_enqueue_and_set_args(py::object enqueue_func, py::object set_args_func) { m_enqueue_func = enqueue_func; m_set_args_func = set_args_func; } py::object enqueue(py::args args, py::kwargs kwargs) const { return m_enqueue_func(py::cast(this), *args, **kwargs); } void set_args(py::args args, py::kwargs kwargs) const { m_set_args_func(py::cast(this), *args, **kwargs); } cl_uint num_args() const { cl_uint param_value; PYOPENCL_CALL_GUARDED(clGetKernelInfo, (m_kernel, CL_KERNEL_NUM_ARGS, sizeof(param_value), ¶m_value, 0)); return param_value; } }; #define PYOPENCL_KERNEL_SET_ARG_MULTI_ERROR_HANDLER \ catch (error &err) \ { \ std::string msg( \ std::string("when processing arg#") + std::to_string(arg_index+1) \ + std::string(" (1-based): ") + std::string(err.what())); \ \ auto mod_cl_ary(py::module_::import_("pyopencl.array")); \ auto cls_array(mod_cl_ary.attr("Array")); \ int isinstance_result = PyObject_IsInstance(arg_value.ptr(), cls_array.ptr()); \ if (isinstance_result == -1) \ throw py::python_error(); \ \ if (arg_value.ptr() && isinstance_result) \ msg.append( \ " (perhaps you meant to pass 'array.data' instead of the array itself?)"); \ throw error(err.routine().c_str(), err.code(), msg.c_str()); \ } \ catch (std::exception &err) \ { \ std::string msg( \ std::string("when processing arg#") + std::to_string(arg_index+1) \ + std::string(" (1-based): ") + std::string(err.what())); \ throw std::runtime_error(msg.c_str()); \ } inline void set_arg_multi( std::function set_arg_func, py::tuple args_and_indices) { cl_uint arg_index; py::handle arg_value; auto it = args_and_indices.begin(), end = args_and_indices.end(); try { /* This is an internal interface that assumes it gets fed well-formed * data. No meaningful error checking is being performed on * off-interval exhaustion of the iterator, on purpose. */ while (it != end) { // special value in case integer cast fails arg_index = 9999 - 1; arg_index = py::cast(*it++); arg_value = *it++; set_arg_func(arg_index, arg_value); } } PYOPENCL_KERNEL_SET_ARG_MULTI_ERROR_HANDLER } inline void set_arg_multi( std::function set_arg_func, py::tuple args_and_indices) { cl_uint arg_index; py::handle arg_descr, arg_value; auto it = args_and_indices.begin(), end = args_and_indices.end(); try { /* This is an internal interface that assumes it gets fed well-formed * data. No meaningful error checking is being performed on * off-interval exhaustion of the iterator, on purpose. */ while (it != end) { // special value in case integer cast fails arg_index = 9999 - 1; arg_index = py::cast(*it++); arg_descr = *it++; arg_value = *it++; set_arg_func(arg_index, arg_descr, arg_value); } } PYOPENCL_KERNEL_SET_ARG_MULTI_ERROR_HANDLER } inline py::list create_kernels_in_program(program &pgm) { cl_uint num_kernels; PYOPENCL_CALL_GUARDED(clCreateKernelsInProgram, ( pgm.data(), 0, 0, &num_kernels)); std::vector kernels(num_kernels); PYOPENCL_CALL_GUARDED(clCreateKernelsInProgram, ( pgm.data(), num_kernels, kernels.empty( ) ? nullptr : &kernels.front(), &num_kernels)); py::list result; for (cl_kernel knl: kernels) result.append(handle_from_new_ptr(new kernel(knl, true))); return result; } #define MAX_WS_DIM_COUNT 10 inline event *enqueue_nd_range_kernel( command_queue &cq, kernel &knl, py::handle py_global_work_size, py::handle py_local_work_size, py::handle py_global_work_offset, py::handle py_wait_for, bool g_times_l, bool allow_empty_ndrange) { PYOPENCL_PARSE_WAIT_FOR; std::array global_work_size; unsigned gws_size = 0; COPY_PY_ARRAY("enqueue_nd_range_kernel", size_t, global_work_size, gws_size); cl_uint work_dim = gws_size; std::array local_work_size; unsigned lws_size = 0; size_t *local_work_size_ptr = nullptr; if (py_local_work_size.ptr() != Py_None) { COPY_PY_ARRAY("enqueue_nd_range_kernel", size_t, local_work_size, lws_size); if (g_times_l) work_dim = std::max(work_dim, lws_size); else if (work_dim != lws_size) throw error("enqueue_nd_range_kernel", CL_INVALID_VALUE, "global/local work sizes have differing dimensions"); while (lws_size < work_dim) local_work_size[lws_size++] = 1; while (gws_size < work_dim) global_work_size[gws_size++] = 1; local_work_size_ptr = &local_work_size.front(); } if (g_times_l && lws_size) { for (cl_uint work_axis = 0; work_axis < work_dim; ++work_axis) global_work_size[work_axis] *= local_work_size[work_axis]; } size_t *global_work_offset_ptr = nullptr; std::array global_work_offset; if (py_global_work_offset.ptr() != Py_None) { unsigned gwo_size = 0; COPY_PY_ARRAY("enqueue_nd_range_kernel", size_t, global_work_offset, gwo_size); if (work_dim != gwo_size) throw error("enqueue_nd_range_kernel", CL_INVALID_VALUE, "global work size and offset have differing dimensions"); if (g_times_l && local_work_size_ptr) { for (cl_uint work_axis = 0; work_axis < work_dim; ++work_axis) global_work_offset[work_axis] *= local_work_size[work_axis]; } global_work_offset_ptr = &global_work_offset.front(); } if (allow_empty_ndrange) { #if PYOPENCL_CL_VERSION >= 0x1020 bool is_empty = false; for (cl_uint work_axis = 0; work_axis < work_dim; ++work_axis) if (global_work_size[work_axis] == 0) is_empty = true; if (local_work_size_ptr) for (cl_uint work_axis = 0; work_axis < work_dim; ++work_axis) if (local_work_size_ptr[work_axis] == 0) is_empty = true; if (is_empty) { cl_event evt; PYOPENCL_CALL_GUARDED(clEnqueueMarkerWithWaitList, ( cq.data(), PYOPENCL_WAITLIST_ARGS, &evt)); PYOPENCL_RETURN_NEW_EVENT(evt); } #else // clEnqueueWaitForEvents + clEnqueueMarker is not equivalent // in the case of an out-of-order queue. throw error("enqueue_nd_range_kernel", CL_INVALID_VALUE, "allow_empty_ndrange requires OpenCL 1.2"); #endif } PYOPENCL_RETRY_RETURN_IF_MEM_ERROR( { cl_event evt; PYOPENCL_CALL_GUARDED(clEnqueueNDRangeKernel, ( cq.data(), knl.data(), work_dim, global_work_offset_ptr, &global_work_size.front(), local_work_size_ptr, PYOPENCL_WAITLIST_ARGS, &evt )); PYOPENCL_RETURN_NEW_EVENT(evt); } ); } // }}} // {{{ gl interop inline bool have_gl() { #ifdef HAVE_GL return true; #else return false; #endif } #ifdef HAVE_GL #ifdef __APPLE__ inline cl_context_properties get_apple_cgl_share_group() { CGLContextObj kCGLContext = CGLGetCurrentContext(); CGLShareGroupObj kCGLShareGroup = CGLGetShareGroup(kCGLContext); return (cl_context_properties) kCGLShareGroup; } #endif /* __APPLE__ */ class gl_buffer : public memory_object { public: gl_buffer(cl_mem mem, bool retain, hostbuf_t hostbuf=hostbuf_t()) : memory_object(mem, retain, std::move(hostbuf)) { } }; class gl_renderbuffer : public memory_object { public: gl_renderbuffer(cl_mem mem, bool retain, hostbuf_t hostbuf=hostbuf_t()) : memory_object(mem, retain, std::move(hostbuf)) { } }; class gl_texture : public image { public: gl_texture(cl_mem mem, bool retain, hostbuf_t hostbuf=hostbuf_t()) : image(mem, retain, std::move(hostbuf)) { } py::object get_gl_texture_info(cl_gl_texture_info param_name) { switch (param_name) { case CL_GL_TEXTURE_TARGET: PYOPENCL_GET_TYPED_INFO(GLTexture, data(), param_name, GLenum); case CL_GL_MIPMAP_LEVEL: PYOPENCL_GET_TYPED_INFO(GLTexture, data(), param_name, GLint); default: throw error("MemoryObject.get_gl_texture_info", CL_INVALID_VALUE); } } }; #define PYOPENCL_WRAP_BUFFER_CREATOR(TYPE, NAME, CL_NAME, ARGS, CL_ARGS) \ inline \ void NAME ARGS \ { \ cl_int status_code; \ PYOPENCL_PRINT_CALL_TRACE(#CL_NAME); \ cl_mem mem = CL_NAME CL_ARGS; \ \ if (status_code != CL_SUCCESS) \ throw pyopencl::error(#CL_NAME, status_code); \ \ try \ { \ new (self) TYPE(mem, false); \ } \ catch (...) \ { \ PYOPENCL_CALL_GUARDED(clReleaseMemObject, (mem)); \ throw; \ } \ } PYOPENCL_WRAP_BUFFER_CREATOR(gl_buffer, create_from_gl_buffer, clCreateFromGLBuffer, (gl_buffer *self, context &ctx, cl_mem_flags flags, GLuint bufobj), (ctx.data(), flags, bufobj, &status_code)); PYOPENCL_WRAP_BUFFER_CREATOR(gl_texture, create_from_gl_texture_2d, clCreateFromGLTexture2D, (gl_texture *self, context &ctx, cl_mem_flags flags, GLenum texture_target, GLint miplevel, GLuint texture), (ctx.data(), flags, texture_target, miplevel, texture, &status_code)); PYOPENCL_WRAP_BUFFER_CREATOR(gl_texture, create_from_gl_texture_3d, clCreateFromGLTexture3D, (gl_texture *self, context &ctx, cl_mem_flags flags, GLenum texture_target, GLint miplevel, GLuint texture), (ctx.data(), flags, texture_target, miplevel, texture, &status_code)); PYOPENCL_WRAP_BUFFER_CREATOR(gl_renderbuffer, create_from_gl_renderbuffer, clCreateFromGLRenderbuffer, (gl_renderbuffer *self, context &ctx, cl_mem_flags flags, GLuint renderbuffer), (ctx.data(), flags, renderbuffer, &status_code)); inline void create_from_gl_texture( gl_texture *self, context &ctx, cl_mem_flags flags, GLenum texture_target, GLint miplevel, GLuint texture, unsigned dims) { if (dims == 2) return create_from_gl_texture_2d(self, ctx, flags, texture_target, miplevel, texture); else if (dims == 3) return create_from_gl_texture_3d(self, ctx, flags, texture_target, miplevel, texture); else throw pyopencl::error("Image", CL_INVALID_VALUE, "invalid dimension"); } inline py::tuple get_gl_object_info(memory_object_holder const &mem) { cl_gl_object_type otype; GLuint gl_name; PYOPENCL_CALL_GUARDED(clGetGLObjectInfo, (mem.data(), &otype, &gl_name)); return py::make_tuple(otype, gl_name); } #define WRAP_GL_ENQUEUE(what, What) \ inline \ event *enqueue_##what##_gl_objects( \ command_queue &cq, \ py::object py_mem_objects, \ py::object py_wait_for) \ { \ PYOPENCL_PARSE_WAIT_FOR; \ \ std::vector mem_objects; \ for (py::handle mo: py_mem_objects) \ mem_objects.push_back(py::cast(mo).data()); \ \ cl_event evt; \ PYOPENCL_CALL_GUARDED(clEnqueue##What##GLObjects, ( \ cq.data(), \ mem_objects.size(), mem_objects.empty( ) ? nullptr : &mem_objects.front(), \ PYOPENCL_WAITLIST_ARGS, &evt \ )); \ \ PYOPENCL_RETURN_NEW_EVENT(evt); \ } WRAP_GL_ENQUEUE(acquire, Acquire); WRAP_GL_ENQUEUE(release, Release); #endif #if defined(cl_khr_gl_sharing) && (cl_khr_gl_sharing >= 1) inline py::object get_gl_context_info_khr( py::object py_properties, cl_gl_context_info param_name, py::object py_platform ) { std::vector props = parse_context_properties(py_properties); typedef CL_API_ENTRY cl_int (CL_API_CALL *func_ptr_type)(const cl_context_properties * /* properties */, cl_gl_context_info /* param_name */, size_t /* param_value_size */, void * /* param_value */, size_t * /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0; func_ptr_type func_ptr; #if PYOPENCL_CL_VERSION >= 0x1020 if (py_platform.ptr() != Py_None) { platform &plat = py::cast(py_platform); func_ptr = (func_ptr_type) clGetExtensionFunctionAddressForPlatform( plat.data(), "clGetGLContextInfoKHR"); } else { PYOPENCL_DEPRECATED("get_gl_context_info_khr with platform=None", "2013.1", ); func_ptr = (func_ptr_type) clGetExtensionFunctionAddress( "clGetGLContextInfoKHR"); } #else func_ptr = (func_ptr_type) clGetExtensionFunctionAddress( "clGetGLContextInfoKHR"); #endif if (!func_ptr) throw error("Context.get_info", CL_INVALID_PLATFORM, "clGetGLContextInfoKHR extension function not present"); cl_context_properties *props_ptr = props.empty( ) ? nullptr : &props.front(); switch (param_name) { case CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR: { cl_device_id param_value; PYOPENCL_CALL_GUARDED(func_ptr, (props_ptr, param_name, sizeof(param_value), ¶m_value, 0)); return py::object(handle_from_new_ptr( \ new device(param_value, /*retain*/ true))); } case CL_DEVICES_FOR_GL_CONTEXT_KHR: { size_t size; PYOPENCL_CALL_GUARDED(func_ptr, (props_ptr, param_name, 0, 0, &size)); std::vector devices; devices.resize(size / sizeof(devices.front())); PYOPENCL_CALL_GUARDED(func_ptr, (props_ptr, param_name, size, devices.empty( ) ? nullptr : &devices.front(), &size)); py::list result; for (cl_device_id did: devices) result.append(handle_from_new_ptr( new device(did))); return result; } default: throw error("get_gl_context_info_khr", CL_INVALID_VALUE); } } #endif // }}} // {{{ deferred implementation bits #if PYOPENCL_CL_VERSION >= 0x2010 inline void context::set_default_device_command_queue(device const &dev, command_queue const &queue) { PYOPENCL_CALL_GUARDED(clSetDefaultDeviceCommandQueue, (m_context, dev.data(), queue.data())); } #endif inline program *error::get_program() const { return new program(m_program, /* retain */ true); } inline py::object create_mem_object_wrapper(cl_mem mem, bool retain=true) { cl_mem_object_type mem_obj_type; PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, \ (mem, CL_MEM_TYPE, sizeof(mem_obj_type), &mem_obj_type, 0)); switch (mem_obj_type) { case CL_MEM_OBJECT_BUFFER: return py::object(handle_from_new_ptr( new buffer(mem, retain))); case CL_MEM_OBJECT_IMAGE2D: case CL_MEM_OBJECT_IMAGE3D: #if PYOPENCL_CL_VERSION >= 0x1020 case CL_MEM_OBJECT_IMAGE2D_ARRAY: case CL_MEM_OBJECT_IMAGE1D: case CL_MEM_OBJECT_IMAGE1D_ARRAY: case CL_MEM_OBJECT_IMAGE1D_BUFFER: #endif return py::object(handle_from_new_ptr( new image(mem, retain))); default: return py::object(handle_from_new_ptr( new memory_object(mem, retain))); } } inline py::object memory_object_from_int(intptr_t cl_mem_as_int, bool retain) { return create_mem_object_wrapper((cl_mem) cl_mem_as_int, retain); } inline py::object memory_object_holder::get_info(cl_mem_info param_name) const { switch (param_name) { case CL_MEM_TYPE: PYOPENCL_GET_TYPED_INFO(MemObject, data(), param_name, cl_mem_object_type); case CL_MEM_FLAGS: PYOPENCL_GET_TYPED_INFO(MemObject, data(), param_name, cl_mem_flags); case CL_MEM_SIZE: PYOPENCL_GET_TYPED_INFO(MemObject, data(), param_name, size_t); case CL_MEM_HOST_PTR: throw pyopencl::error("MemoryObject.get_info", CL_INVALID_VALUE, "Use MemoryObject.get_host_array to get host pointer."); case CL_MEM_MAP_COUNT: PYOPENCL_GET_TYPED_INFO(MemObject, data(), param_name, cl_uint); case CL_MEM_REFERENCE_COUNT: PYOPENCL_GET_TYPED_INFO(MemObject, data(), param_name, cl_uint); case CL_MEM_CONTEXT: PYOPENCL_GET_OPAQUE_INFO(MemObject, data(), param_name, cl_context, context); #if PYOPENCL_CL_VERSION >= 0x1010 case CL_MEM_ASSOCIATED_MEMOBJECT: { cl_mem param_value; PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, \ (data(), param_name, sizeof(param_value), ¶m_value, 0)); if (param_value == 0) { // no associated memory object? no problem. return py::none(); } return create_mem_object_wrapper(param_value); } case CL_MEM_OFFSET: PYOPENCL_GET_TYPED_INFO(MemObject, data(), param_name, size_t); #endif #if PYOPENCL_CL_VERSION >= 0x2000 case CL_MEM_USES_SVM_POINTER: PYOPENCL_GET_TYPED_INFO(MemObject, data(), param_name, cl_bool); #endif #if PYOPENCL_CL_VERSION >= 0x3000 case CL_MEM_PROPERTIES: { std::vector result; PYOPENCL_GET_VEC_INFO(MemObject, data(), param_name, result); PYOPENCL_RETURN_VECTOR(cl_mem_properties, result); } #endif default: throw error("MemoryObjectHolder.get_info", CL_INVALID_VALUE); } } // FIXME: Reenable in pypy #ifndef PYPY_VERSION inline py::object get_mem_obj_host_array( py::object mem_obj_py, py::object shape, py::object dtype, py::object order_py) { memory_object_holder const &mem_obj = py::cast(mem_obj_py); PyArray_Descr *tp_descr; if (PyArray_DescrConverter(dtype.ptr(), &tp_descr) != NPY_SUCCEED) throw py::python_error(); cl_mem_flags mem_flags; PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, (mem_obj.data(), CL_MEM_FLAGS, sizeof(mem_flags), &mem_flags, 0)); if (!(mem_flags & CL_MEM_USE_HOST_PTR)) throw pyopencl::error("MemoryObject.get_host_array", CL_INVALID_VALUE, "Only MemoryObject with USE_HOST_PTR " "is supported."); std::vector dims; try { dims.push_back(py::cast(shape)); } catch (py::cast_error &) { for (auto it: shape) dims.push_back(py::cast(it)); } NPY_ORDER order = NPY_CORDER; PyArray_OrderConverter(order_py.ptr(), &order); int ary_flags = 0; if (order == NPY_FORTRANORDER) ary_flags |= NPY_FARRAY; else if (order == NPY_CORDER) ary_flags |= NPY_CARRAY; else throw std::runtime_error("unrecognized order specifier"); void *host_ptr; size_t mem_obj_size; PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, (mem_obj.data(), CL_MEM_HOST_PTR, sizeof(host_ptr), &host_ptr, 0)); PYOPENCL_CALL_GUARDED(clGetMemObjectInfo, (mem_obj.data(), CL_MEM_SIZE, sizeof(mem_obj_size), &mem_obj_size, 0)); py::object result = py::steal(PyArray_NewFromDescr( &PyArray_Type, tp_descr, dims.size(), &dims.front(), /*strides*/ nullptr, host_ptr, ary_flags, /*obj*/nullptr)); PyArrayObject *result_arr = (PyArrayObject *) result.ptr(); if ((size_t) PyArray_NBYTES(result_arr) > mem_obj_size) throw pyopencl::error("MemoryObject.get_host_array", CL_INVALID_VALUE, "Resulting array is larger than memory object."); PyArray_SetBaseObject(result_arr, mem_obj_py.ptr()); Py_INCREF(mem_obj_py.ptr()); return result; } #endif // }}} } #endif // vim: foldmethod=marker pyopencl-2025.1/src/wrap_cl_part_1.cpp0000644000000000000000000002566414332717401014571 0ustar00// Wrap CL // // Copyright (C) 2009-18 Andreas Kloeckner // // Permission is hereby granted, free of charge, to any person // obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights to use, // copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the // Software is furnished to do so, subject to the following // conditions: // // The above copyright notice and this permission notice shall be // included in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. #define NO_IMPORT_ARRAY #define PY_ARRAY_UNIQUE_SYMBOL pyopencl_ARRAY_API #include "wrap_cl.hpp" using namespace pyopencl; void pyopencl_expose_part_1(py::module_ &m) { m.def("get_cl_header_version", get_cl_header_version); m.def("_sizeof_size_t", [](){ return sizeof(size_t); }); // {{{ platform DEF_SIMPLE_FUNCTION(get_platforms); { typedef platform cls; py::class_(m, "Platform", py::dynamic_attr()) .DEF_SIMPLE_METHOD(get_info) .def("get_devices", &cls::get_devices, py::arg("device_type")=CL_DEVICE_TYPE_ALL) .def("__hash__", &cls::hash) PYOPENCL_EXPOSE_EQUALITY_TESTS PYOPENCL_EXPOSE_TO_FROM_INT_PTR(cl_platform_id) ; } // }}} // {{{ device { typedef device cls; py::class_(m, "Device", py::dynamic_attr()) .DEF_SIMPLE_METHOD(get_info) PYOPENCL_EXPOSE_EQUALITY_TESTS .def("__hash__", &cls::hash) #if PYOPENCL_CL_VERSION >= 0x1020 .DEF_SIMPLE_METHOD(create_sub_devices) #endif PYOPENCL_EXPOSE_TO_FROM_INT_PTR(cl_device_id) #if PYOPENCL_CL_VERSION >= 0x2010 .DEF_SIMPLE_METHOD(device_and_host_timer) .DEF_SIMPLE_METHOD(host_timer) #endif ; } // }}} // {{{ context { typedef context cls; py::class_( m, "Context", py::dynamic_attr(), py::is_weak_referenceable(), py::intrusive_ptr( [](cls *o, PyObject *po) noexcept { o->set_self_py(po); }) ) .def( "__init__", [](cls *self, py::object py_devices, py::object py_properties, py::object py_dev_type) { PYOPENCL_RETRY_IF_MEM_ERROR( create_context_inner( self, py_devices, py_properties, py_dev_type); ) }, py::arg("devices").none(true)=py::none(), py::arg("properties").none(true)=py::none(), py::arg("dev_type").none(true)=py::none() ) .DEF_SIMPLE_METHOD(get_info) PYOPENCL_EXPOSE_EQUALITY_TESTS .def("__hash__", &cls::hash) PYOPENCL_EXPOSE_TO_FROM_INT_PTR(cl_context) #if PYOPENCL_CL_VERSION >= 0x2010 .DEF_SIMPLE_METHOD(set_default_device_command_queue) #endif ; } // }}} // {{{ command queue { typedef command_queue cls; py::class_( m, "CommandQueue", py::dynamic_attr(), py::intrusive_ptr( [](cls *o, PyObject *po) noexcept { o->set_self_py(po); }) ) .def( py::init(), py::arg("context"), py::arg("device").none(true)=py::none(), py::arg("properties")=py::cast(0)) .def("_finalize", &cls::finalize) .DEF_SIMPLE_METHOD(get_info) #if PYOPENCL_CL_VERSION < 0x1010 .DEF_SIMPLE_METHOD(set_property) #endif .DEF_SIMPLE_METHOD(flush) .DEF_SIMPLE_METHOD(finish) PYOPENCL_EXPOSE_EQUALITY_TESTS .def("__hash__", &cls::hash) PYOPENCL_EXPOSE_TO_FROM_INT_PTR(cl_command_queue) ; } // }}} // {{{ events/synchronization { typedef event cls; py::class_(m, "Event") .DEF_SIMPLE_METHOD(get_info) .DEF_SIMPLE_METHOD(get_profiling_info) .DEF_SIMPLE_METHOD(wait) PYOPENCL_EXPOSE_EQUALITY_TESTS .def("__hash__", &cls::hash) PYOPENCL_EXPOSE_TO_FROM_INT_PTR(cl_event) #if PYOPENCL_CL_VERSION >= 0x1010 .DEF_SIMPLE_METHOD(set_callback) #endif ; } { typedef nanny_event cls; py::class_(m, "NannyEvent", py::dynamic_attr()) .DEF_SIMPLE_METHOD(get_ward) ; } DEF_SIMPLE_FUNCTION(wait_for_events); #if PYOPENCL_CL_VERSION >= 0x1020 m.def("_enqueue_marker_with_wait_list", enqueue_marker_with_wait_list, py::arg("queue"), py::arg("wait_for").none(true)=py::none() ); #endif m.def("_enqueue_marker", enqueue_marker, py::arg("queue") ); m.def("_enqueue_wait_for_events", enqueue_wait_for_events, py::arg("queue"), py::arg("wait_for").none(true)=py::none()); #if PYOPENCL_CL_VERSION >= 0x1020 m.def("_enqueue_barrier_with_wait_list", enqueue_barrier_with_wait_list, py::arg("queue"), py::arg("wait_for").none(true)=py::none() ); #endif m.def("_enqueue_barrier", enqueue_barrier, py::arg("queue")); #if PYOPENCL_CL_VERSION >= 0x1010 { typedef user_event cls; py::class_(m, "UserEvent", py::dynamic_attr()) .def("__init__", [](cls *self, context &ctx) { create_user_event(self, ctx); }, py::arg("context")) .DEF_SIMPLE_METHOD(set_status) ; } #endif // }}} // {{{ memory_object { typedef memory_object_holder cls; py::class_(m, "MemoryObjectHolder", py::dynamic_attr()) .DEF_SIMPLE_METHOD(get_info) // FIXME: Reenable in pypy #ifndef PYPY_VERSION .def("get_host_array", get_mem_obj_host_array, py::arg("shape"), py::arg("dtype"), py::arg("order")="C") #endif PYOPENCL_EXPOSE_EQUALITY_TESTS .def("__hash__", &cls::hash) .def_prop_ro("int_ptr", to_int_ptr, "Return an integer corresponding to the pointer value " "of the underlying :c:type:`cl_mem`. " "Use :meth:`from_int_ptr` to turn back into a Python object." "\n\n.. versionadded:: 2013.2\n") ; } { typedef memory_object cls; py::class_(m, "MemoryObject", py::dynamic_attr()) .DEF_SIMPLE_METHOD(release) .def_prop_ro("hostbuf", &cls::hostbuf) .def_static("from_int_ptr", memory_object_from_int, "(static method) Return a new Python object referencing the C-level " ":c:type:`cl_mem` object at the location pointed to " "by *int_ptr_value*. The relevant ``clRetain*`` function " "will be called if *retain* is True." "If the previous owner of the object will *not* release the reference, " "*retain* should be set to *False*, to effectively transfer ownership to " ":mod:`pyopencl`." "\n\n.. versionadded:: 2013.2\n" "\n\n.. versionchanged:: 2016.1\n\n *retain* added.", py::arg("int_ptr_value"), py::arg("retain")=true) ; } #if PYOPENCL_CL_VERSION >= 0x1020 m.def("enqueue_migrate_mem_objects", enqueue_migrate_mem_objects, py::arg("queue"), py::arg("mem_objects"), py::arg("flags")=0, py::arg("wait_for").none(true)=py::none() ); #endif // }}} // {{{ buffer { typedef buffer cls; py::class_(m, "Buffer", py::dynamic_attr()) .def( "__init__", [](cls *self, context &ctx, cl_mem_flags flags, size_t size, py::object py_hostbuf) { create_buffer_py(self, ctx, flags, size, py_hostbuf); }, py::arg("context"), py::arg("flags"), py::arg("size")=0, py::arg("hostbuf").none(true)=py::none() ) #if PYOPENCL_CL_VERSION >= 0x1010 .def("get_sub_region", &cls::get_sub_region, py::arg("origin"), py::arg("size"), py::arg("flags")=0 ) .def("__getitem__", &cls::getitem) #endif ; } // }}} // {{{ transfers // {{{ byte-for-byte m.def("_enqueue_read_buffer", enqueue_read_buffer, py::arg("queue"), py::arg("mem"), py::arg("hostbuf"), py::arg("src_offset")=0, py::arg("wait_for").none(true)=py::none(), py::arg("is_blocking")=true ); m.def("_enqueue_write_buffer", enqueue_write_buffer, py::arg("queue"), py::arg("mem"), py::arg("hostbuf"), py::arg("dst_offset")=0, py::arg("wait_for").none(true)=py::none(), py::arg("is_blocking")=true ); m.def("_enqueue_copy_buffer", enqueue_copy_buffer, py::arg("queue"), py::arg("src"), py::arg("dst"), py::arg("byte_count")=-1, py::arg("src_offset")=0, py::arg("dst_offset")=0, py::arg("wait_for").none(true)=py::none() ); #ifdef CL_DEVICE_P2P_DEVICES_AMD m.def("enqueue_copy_buffer_p2p_amd", enqueue_copy_buffer_p2p_amd, py::arg("platform"), py::arg("queue"), py::arg("src"), py::arg("dst"), py::arg("byte_count").none(true)=py::none(), py::arg("wait_for").none(true)=py::none() ); #endif // }}} // {{{ rectangular #if PYOPENCL_CL_VERSION >= 0x1010 m.def("_enqueue_read_buffer_rect", enqueue_read_buffer_rect, py::arg("queue"), py::arg("mem"), py::arg("hostbuf"), py::arg("buffer_origin"), py::arg("host_origin"), py::arg("region"), py::arg("buffer_pitches").none(true)=py::none(), py::arg("host_pitches").none(true)=py::none(), py::arg("wait_for").none(true)=py::none(), py::arg("is_blocking")=true ); m.def("_enqueue_write_buffer_rect", enqueue_write_buffer_rect, py::arg("queue"), py::arg("mem"), py::arg("hostbuf"), py::arg("buffer_origin"), py::arg("host_origin"), py::arg("region"), py::arg("buffer_pitches").none(true)=py::none(), py::arg("host_pitches").none(true)=py::none(), py::arg("wait_for").none(true)=py::none(), py::arg("is_blocking")=true ); m.def("_enqueue_copy_buffer_rect", enqueue_copy_buffer_rect, py::arg("queue"), py::arg("src"), py::arg("dst"), py::arg("src_origin"), py::arg("dst_origin"), py::arg("region"), py::arg("src_pitches").none(true)=py::none(), py::arg("dst_pitches").none(true)=py::none(), py::arg("wait_for").none(true)=py::none() ); #endif // }}} // }}} #if PYOPENCL_CL_VERSION >= 0x1020 m.def("_enqueue_fill_buffer", enqueue_fill_buffer, py::arg("queue"), py::arg("mem"), py::arg("pattern"), py::arg("offset"), py::arg("size"), py::arg("wait_for").none(true)=py::none()); #endif } // vim: foldmethod=marker pyopencl-2025.1/src/wrap_cl_part_2.cpp0000644000000000000000000005144714332717401014570 0ustar00// Wrap CL // // Copyright (C) 2009-18 Andreas Kloeckner // // Permission is hereby granted, free of charge, to any person // obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights to use, // copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the // Software is furnished to do so, subject to the following // conditions: // // The above copyright notice and this permission notice shall be // included in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. #include #define NO_IMPORT_ARRAY #define PY_ARRAY_UNIQUE_SYMBOL pyopencl_ARRAY_API #include "wrap_cl.hpp" namespace pyopencl { #if PYOPENCL_CL_VERSION >= 0x1020 py::object image_desc_dummy_getter(cl_image_desc &desc) { return py::none(); } void image_desc_set_shape(cl_image_desc &desc, py::object py_shape) { COPY_PY_REGION_TRIPLE(shape); desc.image_width = shape[0]; desc.image_height = shape[1]; desc.image_depth = shape[2]; desc.image_array_size = shape[2]; } void image_desc_set_pitches(cl_image_desc &desc, py::object py_pitches) { COPY_PY_PITCH_TUPLE(pitches); desc.image_row_pitch = pitches[0]; desc.image_slice_pitch = pitches[1]; } void image_desc_set_buffer(cl_image_desc &desc, memory_object *mobj) { if (mobj) desc.buffer = mobj->data(); else desc.buffer = 0; } #endif } using namespace pyopencl; static PyCFunctionWithKeywords dummy_init = [](PyObject *, PyObject *, PyObject *) -> PyObject * { PyErr_SetString(PyExc_RuntimeError, "This should never be called!"); return nullptr; }; static PyType_Slot init_slots[] { // the presence of this slot enables normal object construction via __init__ and __new__ // instead of an optimized codepath within nanobind that skips these. That in turn // makes it possible to intercept calls and implement custom logic. { Py_tp_init, (void *) dummy_init }, { 0, nullptr } }; void pyopencl_expose_part_2(py::module_ &m) { // {{{ image #if PYOPENCL_CL_VERSION >= 0x1020 { typedef cl_image_desc cls; py::class_(m, "ImageDescriptor") .def(py::init<>()) .def_rw("image_type", &cls::image_type) .def_prop_rw("shape", &image_desc_dummy_getter, image_desc_set_shape) .def_rw("array_size", &cls::image_array_size) .def_prop_rw("pitches", &image_desc_dummy_getter, image_desc_set_pitches) .def_rw("num_mip_levels", &cls::num_mip_levels) .def_rw("num_samples", &cls::num_samples) .def_prop_rw("buffer", &image_desc_dummy_getter, image_desc_set_buffer, py::arg("buffer").none() ) ; } #endif { typedef image cls; // https://github.com/wjakob/nanobind/issues/750 py::class_(m, "Image", py::dynamic_attr(), py::type_slots(init_slots)) .def_static( "_custom_init", []( py::handle_t h, context const &ctx, cl_mem_flags flags, cl_image_format const &fmt, py::sequence shape, py::sequence pitches, py::object buffer) { if (py::inst_ready(h)) py::raise_type_error("Image is already initialized!"); image *self = py::inst_ptr(h); create_image(self, ctx, flags, fmt, shape, pitches, buffer); py::inst_mark_ready(h); }, py::arg("h"), py::arg("context"), py::arg("flags"), py::arg("format"), py::arg("shape")=py::none(), py::arg("pitches")=py::none(), py::arg("hostbuf")=py::none() ) #if PYOPENCL_CL_VERSION >= 0x1020 .def_static( "_custom_init", []( py::handle_t h, context const &ctx, cl_mem_flags flags, cl_image_format const &fmt, cl_image_desc &desc, py::object buffer) { if (py::inst_ready(h)) py::raise_type_error("Image is already initialized!"); image *self = py::inst_ptr(h); create_image_from_desc(self, ctx, flags, fmt, desc, buffer); py::inst_mark_ready(h); }, py::arg("h"), py::arg("context"), py::arg("flags"), py::arg("format"), py::arg("desc"), py::arg("hostbuf")=py::none() ) #endif .DEF_SIMPLE_METHOD(get_image_info) ; } { typedef cl_image_format cls; py::class_(m, "ImageFormat") .def( "__init__", [](cls *self, cl_channel_order ord, cl_channel_type tp) { set_image_format(self, ord, tp); }) .def_rw("channel_order", &cls::image_channel_order) .def_rw("channel_data_type", &cls::image_channel_data_type) .def_prop_ro("channel_count", &get_image_format_channel_count) .def_prop_ro("dtype_size", &get_image_format_channel_dtype_size) .def_prop_ro("itemsize", &get_image_format_item_size) ; } DEF_SIMPLE_FUNCTION(get_supported_image_formats); m.def("_enqueue_read_image", enqueue_read_image, py::arg("queue"), py::arg("mem"), py::arg("origin"), py::arg("region"), py::arg("hostbuf"), py::arg("row_pitch")=0, py::arg("slice_pitch")=0, py::arg("wait_for")=py::none(), py::arg("is_blocking")=true ); m.def("_enqueue_write_image", enqueue_write_image, py::arg("queue"), py::arg("mem"), py::arg("origin"), py::arg("region"), py::arg("hostbuf"), py::arg("row_pitch")=0, py::arg("slice_pitch")=0, py::arg("wait_for")=py::none(), py::arg("is_blocking")=true ); m.def("_enqueue_copy_image", enqueue_copy_image, py::arg("queue"), py::arg("src"), py::arg("dest"), py::arg("src_origin"), py::arg("dest_origin"), py::arg("region"), py::arg("wait_for")=py::none() ); m.def("_enqueue_copy_image_to_buffer", enqueue_copy_image_to_buffer, py::arg("queue"), py::arg("src"), py::arg("dest"), py::arg("origin"), py::arg("region"), py::arg("offset"), py::arg("wait_for")=py::none() ); m.def("_enqueue_copy_buffer_to_image", enqueue_copy_buffer_to_image, py::arg("queue"), py::arg("src"), py::arg("dest"), py::arg("offset"), py::arg("origin"), py::arg("region"), py::arg("wait_for")=py::none() ); #if PYOPENCL_CL_VERSION >= 0x1020 m.def("enqueue_fill_image", enqueue_fill_image, py::arg("queue"), py::arg("mem"), py::arg("color"), py::arg("origin"), py::arg("region"), py::arg("wait_for")=py::none() ); #endif // }}} // {{{ pipe { typedef pyopencl::pipe cls; py::class_(m, "Pipe", py::dynamic_attr()) #if PYOPENCL_CL_VERSION >= 0x2000 .def( "__init__", []( cls *self, context const &ctx, cl_mem_flags flags, cl_uint pipe_packet_size, cl_uint pipe_max_packets, py::sequence py_props) { create_pipe(self, ctx, flags, pipe_packet_size, pipe_max_packets, py_props); }, py::arg("context"), py::arg("flags"), py::arg("packet_size"), py::arg("max_packets"), py::arg("properties")=py::make_tuple() ) #endif .DEF_SIMPLE_METHOD(get_pipe_info) ; } // }}} // {{{ memory_map { typedef memory_map cls; py::class_(m, "MemoryMap", py::dynamic_attr()) .def("release", &cls::release, py::arg("queue").none(true)=nullptr, py::arg("wait_for").none(true)=py::none() ) ; } // FIXME: Reenable in pypy #ifndef PYPY_VERSION m.def("enqueue_map_buffer", enqueue_map_buffer, py::arg("queue"), py::arg("buf"), py::arg("flags"), py::arg("offset"), py::arg("shape"), py::arg("dtype"), py::arg("order")="C", py::arg("strides").none(true)=py::none(), py::arg("wait_for").none(true)=py::none(), py::arg("is_blocking")=true); m.def("enqueue_map_image", enqueue_map_image, py::arg("queue"), py::arg("img"), py::arg("flags"), py::arg("origin"), py::arg("region"), py::arg("shape"), py::arg("dtype"), py::arg("order")="C", py::arg("strides").none(true)=py::none(), py::arg("wait_for").none(true)=py::none(), py::arg("is_blocking")=true); #endif // }}} // {{{ svm_pointer #if PYOPENCL_CL_VERSION >= 0x2000 { typedef svm_pointer cls; py::class_(m, "SVMPointer", py::dynamic_attr()) // For consistency, it may seem appropriate to use int_ptr here, but // that would work on both buffers and SVM, and passing a buffer pointer to // a kernel is going to lead to a bad time. .def_prop_ro("svm_ptr", [](cls &self) { return (intptr_t) self.svm_ptr(); }) .def_prop_ro("size", [](cls &self) -> py::object { try { return py::cast(self.size()); } catch (size_not_available) { return py::none(); } }) .def_prop_ro("buf", [](cls &self) -> py::ndarray> { size_t size; try { size = self.size(); } catch (size_not_available) { throw pyopencl::error("SVMPointer buffer protocol", CL_INVALID_VALUE, "size of SVM is not known"); } return py::ndarray>( /* data = */ self.svm_ptr(), /* ndim = */ 1, /* shape pointer = */ &size, /* owner = */ py::handle()); }, py::rv_policy::reference_internal) ; } // }}} // {{{ svm_arg_wrapper { typedef svm_arg_wrapper cls; py::class_(m, "SVM", py::dynamic_attr()) .def(py::init()) .def_prop_ro("mem", &cls::mem) ; } // }}} // {{{ svm_allocation { typedef svm_allocation cls; py::class_(m, "SVMAllocation", py::dynamic_attr()) .def(py::init, size_t, cl_uint, cl_svm_mem_flags, const command_queue *>(), py::arg("context"), py::arg("size"), py::arg("alignment"), py::arg("flags"), py::arg("queue").none(true)=py::none() ) .DEF_SIMPLE_METHOD(release) .def("enqueue_release", &cls::enqueue_release, ":returns: a :class:`pyopencl.Event`\n\n" "|std-enqueue-blurb|", py::arg("queue").none(true)=py::none(), py::arg("wait_for").none(true)=py::none() ) PYOPENCL_EXPOSE_EQUALITY_TESTS .def("__hash__", [](cls &self) { return (intptr_t) self.svm_ptr(); }) .def("bind_to_queue", &cls::bind_to_queue, py::arg("queue")) .DEF_SIMPLE_METHOD(unbind_from_queue) // only for diagnostic/debugging/testing purposes! .def_prop_ro("_queue", [](cls const &self) -> py::object { cl_command_queue queue = self.queue(); if (queue) return py::cast(new command_queue(queue, true)); else return py::none(); }) ; } // }}} // {{{ svm operations m.def("_enqueue_svm_memcpy", enqueue_svm_memcpy, py::arg("queue"), py::arg("is_blocking"), py::arg("dst"), py::arg("src"), py::arg("wait_for").none(true)=py::none(), py::arg("byte_count").none(true)=py::none() ); m.def("_enqueue_svm_memfill", enqueue_svm_memfill, py::arg("queue"), py::arg("dst"), py::arg("pattern"), py::arg("byte_count").none(true)=py::none(), py::arg("wait_for").none(true)=py::none() ); m.def("_enqueue_svm_map", enqueue_svm_map, py::arg("queue"), py::arg("is_blocking"), py::arg("flags"), py::arg("svm"), py::arg("wait_for").none(true)=py::none(), py::arg("size").none(true)=py::none() ); m.def("_enqueue_svm_unmap", enqueue_svm_unmap, py::arg("queue"), py::arg("svm"), py::arg("wait_for").none(true)=py::none() ); #endif #if PYOPENCL_CL_VERSION >= 0x2010 m.def("_enqueue_svm_migrate_mem", enqueue_svm_migratemem, py::arg("queue"), py::arg("svms"), py::arg("flags").none(true)=py::none(), py::arg("wait_for").none(true)=py::none() ); #endif // }}} // {{{ sampler { typedef sampler cls; py::class_(m, "Sampler", py::dynamic_attr()) #if PYOPENCL_CL_VERSION >= 0x2000 .def(py::init()) #endif .def(py::init()) .DEF_SIMPLE_METHOD(get_info) PYOPENCL_EXPOSE_EQUALITY_TESTS .def("__hash__", &cls::hash) PYOPENCL_EXPOSE_TO_FROM_INT_PTR(cl_sampler) ; } // }}} // {{{ program { typedef program cls; py::enum_(m, "program_kind") .value("UNKNOWN", cls::KND_UNKNOWN) .value("SOURCE", cls::KND_SOURCE) .value("BINARY", cls::KND_BINARY) .value("IL", cls::KND_IL) ; py::class_(m, "_Program", py::dynamic_attr()) .def( "__init__", [](cls *self, context &ctx, std::string const &src) { create_program_with_source(self, ctx, src); }, py::arg("context"), py::arg("src")) .def( "__init__", [](cls *self, context &ctx, py::sequence devices, py::sequence binaries) { return create_program_with_binary(self, ctx, devices, binaries); }, py::arg("context"), py::arg("devices"), py::arg("binaries")) #if (PYOPENCL_CL_VERSION >= 0x1020) || \ ((PYOPENCL_CL_VERSION >= 0x1030) && defined(__APPLE__)) .def_static("create_with_built_in_kernels", create_program_with_built_in_kernels, py::arg("context"), py::arg("devices"), py::arg("kernel_names")) #endif .DEF_SIMPLE_METHOD(kind) .DEF_SIMPLE_METHOD(get_info) .DEF_SIMPLE_METHOD(get_build_info) .def("_build", &cls::build, py::arg("options")="", py::arg("devices").none(true)=py::none()) #if PYOPENCL_CL_VERSION >= 0x1020 .def("compile", &cls::compile, py::arg("options")="", py::arg("devices").none(true)=py::none(), py::arg("headers")=py::list()) .def_static("link", &link_program, py::arg("context"), py::arg("programs"), py::arg("options")="", py::arg("devices").none(true)=py::none() ) #endif #if PYOPENCL_CL_VERSION >= 0x2020 .def("set_specialization_constant", &cls::set_specialization_constant, py::arg("spec_id"), py::arg("buffer")) #endif PYOPENCL_EXPOSE_EQUALITY_TESTS .def("__hash__", &cls::hash) .def("all_kernels", create_kernels_in_program) PYOPENCL_EXPOSE_TO_FROM_INT_PTR(cl_program) ; } #if (PYOPENCL_CL_VERSION >= 0x2010) m.def("_create_program_with_il", create_program_with_il); #endif #if PYOPENCL_CL_VERSION >= 0x1020 m.def("unload_platform_compiler", unload_platform_compiler); #endif // }}} // {{{ kernel { typedef kernel cls; py::class_(m, "Kernel", py::dynamic_attr()) .def(py::init()) .def_prop_ro("_source", &cls::source) .DEF_SIMPLE_METHOD(get_info) .DEF_SIMPLE_METHOD(get_work_group_info) #if PYOPENCL_CL_VERSION >= 0x2010 .DEF_SIMPLE_METHOD(clone) #endif .def("_set_arg_null", &cls::set_arg_null) .def("_set_arg_buf", &cls::set_arg_buf) #if PYOPENCL_CL_VERSION >= 0x2000 .def("_set_arg_svm", &cls::set_arg_svm) #endif .def("_set_arg_multi", [](cls &knl, py::tuple indices_and_args) { set_arg_multi( [&](cl_uint i, py::handle arg) { knl.set_arg(i, arg); }, indices_and_args); }) .def("_set_arg_buf_multi", [](cls &knl, py::tuple indices_and_args) { set_arg_multi( [&](cl_uint i, py::handle arg) { knl.set_arg_buf(i, arg); }, indices_and_args); }) .def("_set_arg_buf_pack_multi", [](cls &knl, py::tuple indices_chars_and_args) { set_arg_multi( [&](cl_uint i, py::handle typechar, py::handle arg) { knl.set_arg_buf_pack(i, typechar, arg); }, indices_chars_and_args); }) .DEF_SIMPLE_METHOD(set_arg) #if PYOPENCL_CL_VERSION >= 0x1020 .DEF_SIMPLE_METHOD(get_arg_info) #endif PYOPENCL_EXPOSE_EQUALITY_TESTS .def("__hash__", &cls::hash) PYOPENCL_EXPOSE_TO_FROM_INT_PTR(cl_kernel) #if PYOPENCL_CL_VERSION >= 0x2010 .def("get_sub_group_info", &cls::get_sub_group_info, py::arg("device"), py::arg("param"), py::arg("input_value").none(true)=py::none() ) #endif .def("__call__", &cls::enqueue) .def("set_args", &cls::set_args) .def("_set_enqueue_and_set_args", &cls::set_enqueue_and_set_args) ; } { typedef local_memory cls; py::class_(m, "LocalMemory", py::dynamic_attr()) .def( py::init(), py::arg("size")) .def_prop_ro("size", &cls::size) ; } m.def("enqueue_nd_range_kernel", enqueue_nd_range_kernel, py::arg("queue"), py::arg("kernel"), py::arg("global_work_size"), py::arg("local_work_size").none(true), py::arg("global_work_offset").none(true)=py::none(), py::arg("wait_for").none(true)=py::none(), py::arg("g_times_l")=false, py::arg("allow_empty_ndrange")=false ); // TODO: clEnqueueNativeKernel // }}} // {{{ GL interop DEF_SIMPLE_FUNCTION(have_gl); #ifdef HAVE_GL #ifdef __APPLE__ DEF_SIMPLE_FUNCTION(get_apple_cgl_share_group); #endif /* __APPLE__ */ { typedef gl_buffer cls; py::class_(m, "GLBuffer", py::dynamic_attr()) .def( "__init__", [](cls *self, context &ctx, cl_mem_flags flags, GLuint bufobj) { create_from_gl_buffer(self, ctx, flags, bufobj); }, py::arg("context"), py::arg("flags"), py::arg("bufobj")) .def("get_gl_object_info", get_gl_object_info) ; } { typedef gl_renderbuffer cls; py::class_(m, "GLRenderBuffer", py::dynamic_attr()) .def( "__init__", [](cls *self, context &ctx, cl_mem_flags flags, GLuint bufobj) { create_from_gl_renderbuffer(self, ctx, flags, bufobj); }, py::arg("context"), py::arg("flags"), py::arg("bufobj")) .def("get_gl_object_info", get_gl_object_info) ; } { typedef gl_texture cls; py::class_(m, "GLTexture", py::dynamic_attr()) .def( "__init__", [](cls *self, context &ctx, cl_mem_flags flags, GLenum texture_target, GLint miplevel, GLuint texture, unsigned dims) { create_from_gl_texture(self, ctx, flags, texture_target, miplevel, texture, dims); }, py::arg("context"), py::arg("flags"), py::arg("texture_target"), py::arg("miplevel"), py::arg("texture"), py::arg("dims")) .def("get_gl_object_info", get_gl_object_info) .DEF_SIMPLE_METHOD(get_gl_texture_info) ; } m.def("enqueue_acquire_gl_objects", enqueue_acquire_gl_objects, py::arg("queue"), py::arg("mem_objects"), py::arg("wait_for").none(true)=py::none() ); m.def("enqueue_release_gl_objects", enqueue_release_gl_objects, py::arg("queue"), py::arg("mem_objects"), py::arg("wait_for").none(true)=py::none() ); #if defined(cl_khr_gl_sharing) && (cl_khr_gl_sharing >= 1) m.def("get_gl_context_info_khr", get_gl_context_info_khr, py::arg("properties"), py::arg("param_name"), py::arg("platform").none(true)=py::none() ); #endif #endif // }}} } // vim: foldmethod=marker pyopencl-2025.1/src/wrap_constants.cpp0000644000000000000000000011314014332717401014724 0ustar00// Wrap CL constants and errors // // Copyright (C) 2009 Andreas Kloeckner // // Permission is hereby granted, free of charge, to any person // obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights to use, // copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the // Software is furnished to do so, subject to the following // conditions: // // The above copyright notice and this permission notice shall be // included in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. #define NO_IMPORT_ARRAY #define PY_ARRAY_UNIQUE_SYMBOL pyopencl_ARRAY_API #include "wrap_cl.hpp" using namespace pyopencl; namespace { // {{{ 'fake' constant scopes class status_code { }; class platform_info { }; class device_type { }; class device_info { }; class device_topology_type_amd { }; class device_fp_config { }; class device_mem_cache_type { }; class device_local_mem_type { }; class device_exec_capabilities { }; class device_svm_capabilities { }; class command_queue_properties { }; class context_info { }; class gl_context_info { }; class context_properties { }; class command_queue_info { }; class queue_properties { }; class mem_flags { }; class svm_mem_flags { }; class channel_order { }; class channel_type { }; class mem_object_type { }; class mem_info { }; class image_info { }; class pipe_info { }; class pipe_properties { }; class addressing_mode { }; class filter_mode { }; class sampler_info { }; class sampler_properties { }; class map_flags { }; class program_info { }; class program_build_info { }; class program_binary_type { }; class build_status { }; class kernel_info { }; class kernel_arg_info { }; class kernel_arg_address_qualifier { }; class kernel_arg_access_qualifier { }; class kernel_arg_type_qualifier { }; class kernel_work_group_info { }; class kernel_sub_group_info { }; class event_info { }; class command_type { }; class command_execution_status { }; class profiling_info { }; class buffer_create_type { }; class mem_migration_flags { }; class device_partition_property { }; class device_affinity_domain { }; class device_atomic_capabilities { }; class device_device_enqueue_capabilities { }; class version_bits { }; class khronos_vendor_id { }; class gl_object_type { }; class gl_texture_info { }; // }}} } void pyopencl_expose_constants(py::module_ &m) { // {{{ exceptions { #define DECLARE_EXC(NAME, BASE) \ static py::exception CL##NAME(m, #NAME, BASE); DECLARE_EXC(Error, PyExc_Exception); DECLARE_EXC(MemoryError, CLError.ptr()); DECLARE_EXC(LogicError, CLError.ptr()); DECLARE_EXC(RuntimeError, CLError.ptr()); py::register_exception_translator( [](const std::exception_ptr &p, void * /* unused */) { try { if (p) std::rethrow_exception(p); } catch (pyopencl::error &err) { py::object err_obj = py::cast(err); if (err.code() == CL_MEM_OBJECT_ALLOCATION_FAILURE) PyErr_SetObject(CLMemoryError.ptr(), err_obj.ptr()); else if (err.code() <= CL_INVALID_VALUE) PyErr_SetObject(CLLogicError.ptr(), err_obj.ptr()); else if (err.code() > CL_INVALID_VALUE && err.code() < CL_SUCCESS) PyErr_SetObject(CLRuntimeError.ptr(), err_obj.ptr()); else PyErr_SetObject(CLError.ptr(), err_obj.ptr()); } }); } // }}} // {{{ error record { typedef error cls; py::class_ (m, "_ErrorRecord") .def(py::init(), py::arg("routine"), py::arg("code"), py::arg("msg")) .DEF_SIMPLE_METHOD(routine) .DEF_SIMPLE_METHOD(code) .def("what", &cls::err_what) .DEF_SIMPLE_METHOD(is_out_of_memory) .def("_program", &cls::get_program) ; } // }}} // {{{ constants #define ADD_ATTR(PREFIX, NAME) \ cls.attr(#NAME) = CL_##PREFIX##NAME #define ADD_ATTR_SUFFIX(PREFIX, NAME, SUFFIX) \ cls.attr(#NAME) = CL_##PREFIX##NAME##SUFFIX { py::class_ cls(m, "status_code"); ADD_ATTR(, SUCCESS); ADD_ATTR(, DEVICE_NOT_FOUND); ADD_ATTR(, DEVICE_NOT_AVAILABLE); #if !(defined(CL_PLATFORM_NVIDIA) && CL_PLATFORM_NVIDIA == 0x3001) ADD_ATTR(, COMPILER_NOT_AVAILABLE); #endif ADD_ATTR(, MEM_OBJECT_ALLOCATION_FAILURE); ADD_ATTR(, OUT_OF_RESOURCES); ADD_ATTR(, OUT_OF_HOST_MEMORY); ADD_ATTR(, PROFILING_INFO_NOT_AVAILABLE); ADD_ATTR(, MEM_COPY_OVERLAP); ADD_ATTR(, IMAGE_FORMAT_MISMATCH); ADD_ATTR(, IMAGE_FORMAT_NOT_SUPPORTED); ADD_ATTR(, BUILD_PROGRAM_FAILURE); ADD_ATTR(, MAP_FAILURE); ADD_ATTR(, INVALID_VALUE); ADD_ATTR(, INVALID_DEVICE_TYPE); ADD_ATTR(, INVALID_PLATFORM); ADD_ATTR(, INVALID_DEVICE); ADD_ATTR(, INVALID_CONTEXT); ADD_ATTR(, INVALID_QUEUE_PROPERTIES); ADD_ATTR(, INVALID_COMMAND_QUEUE); ADD_ATTR(, INVALID_HOST_PTR); ADD_ATTR(, INVALID_MEM_OBJECT); ADD_ATTR(, INVALID_IMAGE_FORMAT_DESCRIPTOR); ADD_ATTR(, INVALID_IMAGE_SIZE); ADD_ATTR(, INVALID_SAMPLER); ADD_ATTR(, INVALID_BINARY); ADD_ATTR(, INVALID_BUILD_OPTIONS); ADD_ATTR(, INVALID_PROGRAM); ADD_ATTR(, INVALID_PROGRAM_EXECUTABLE); ADD_ATTR(, INVALID_KERNEL_NAME); ADD_ATTR(, INVALID_KERNEL_DEFINITION); ADD_ATTR(, INVALID_KERNEL); ADD_ATTR(, INVALID_ARG_INDEX); ADD_ATTR(, INVALID_ARG_VALUE); ADD_ATTR(, INVALID_ARG_SIZE); ADD_ATTR(, INVALID_KERNEL_ARGS); ADD_ATTR(, INVALID_WORK_DIMENSION); ADD_ATTR(, INVALID_WORK_GROUP_SIZE); ADD_ATTR(, INVALID_WORK_ITEM_SIZE); ADD_ATTR(, INVALID_GLOBAL_OFFSET); ADD_ATTR(, INVALID_EVENT_WAIT_LIST); ADD_ATTR(, INVALID_EVENT); ADD_ATTR(, INVALID_OPERATION); ADD_ATTR(, INVALID_GL_OBJECT); ADD_ATTR(, INVALID_BUFFER_SIZE); ADD_ATTR(, INVALID_MIP_LEVEL); #if defined(cl_khr_icd) && (cl_khr_icd >= 1) ADD_ATTR(, PLATFORM_NOT_FOUND_KHR); #endif #if defined(cl_khr_gl_sharing) && (cl_khr_gl_sharing >= 1) ADD_ATTR(, INVALID_GL_SHAREGROUP_REFERENCE_KHR); #endif #if PYOPENCL_CL_VERSION >= 0x1010 ADD_ATTR(, MISALIGNED_SUB_BUFFER_OFFSET); ADD_ATTR(, EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST); ADD_ATTR(, INVALID_GLOBAL_WORK_SIZE); #endif #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(, COMPILE_PROGRAM_FAILURE); ADD_ATTR(, LINKER_NOT_AVAILABLE); ADD_ATTR(, LINK_PROGRAM_FAILURE); ADD_ATTR(, DEVICE_PARTITION_FAILED); ADD_ATTR(, KERNEL_ARG_INFO_NOT_AVAILABLE); ADD_ATTR(, INVALID_IMAGE_DESCRIPTOR); ADD_ATTR(, INVALID_COMPILER_OPTIONS); ADD_ATTR(, INVALID_LINKER_OPTIONS); ADD_ATTR(, INVALID_DEVICE_PARTITION_COUNT); #endif #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(, INVALID_PIPE_SIZE); ADD_ATTR(, INVALID_DEVICE_QUEUE); #endif #if PYOPENCL_CL_VERSION >= 0x2020 ADD_ATTR(, INVALID_SPEC_ID); ADD_ATTR(, MAX_SIZE_RESTRICTION_EXCEEDED); #endif #if defined(cl_ext_device_fission) && defined(PYOPENCL_USE_DEVICE_FISSION) ADD_ATTR(, DEVICE_PARTITION_FAILED_EXT); ADD_ATTR(, INVALID_PARTITION_COUNT_EXT); ADD_ATTR(, INVALID_PARTITION_NAME_EXT); #endif } { py::class_ cls(m, "platform_info"); ADD_ATTR(PLATFORM_, PROFILE); ADD_ATTR(PLATFORM_, VERSION); ADD_ATTR(PLATFORM_, NAME); ADD_ATTR(PLATFORM_, VENDOR); #if !(defined(CL_PLATFORM_NVIDIA) && CL_PLATFORM_NVIDIA == 0x3001) ADD_ATTR(PLATFORM_, EXTENSIONS); #endif #if PYOPENCL_CL_VERSION >= 0x2010 ADD_ATTR(PLATFORM_, HOST_TIMER_RESOLUTION); #endif #if PYOPENCL_CL_VERSION >= 0x3000 ADD_ATTR(PLATFORM_, NUMERIC_VERSION); ADD_ATTR(PLATFORM_, EXTENSIONS_WITH_VERSION); #endif } { py::class_ cls(m, "device_type"); ADD_ATTR(DEVICE_TYPE_, DEFAULT); ADD_ATTR(DEVICE_TYPE_, CPU); ADD_ATTR(DEVICE_TYPE_, GPU); ADD_ATTR(DEVICE_TYPE_, ACCELERATOR); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(DEVICE_TYPE_, CUSTOM); #endif ADD_ATTR(DEVICE_TYPE_, ALL); } { py::class_ cls(m, "device_info"); ADD_ATTR(DEVICE_, TYPE); ADD_ATTR(DEVICE_, VENDOR_ID); ADD_ATTR(DEVICE_, MAX_COMPUTE_UNITS); ADD_ATTR(DEVICE_, MAX_WORK_ITEM_DIMENSIONS); ADD_ATTR(DEVICE_, MAX_WORK_GROUP_SIZE); ADD_ATTR(DEVICE_, MAX_WORK_ITEM_SIZES); ADD_ATTR(DEVICE_, PREFERRED_VECTOR_WIDTH_CHAR); ADD_ATTR(DEVICE_, PREFERRED_VECTOR_WIDTH_SHORT); ADD_ATTR(DEVICE_, PREFERRED_VECTOR_WIDTH_INT); ADD_ATTR(DEVICE_, PREFERRED_VECTOR_WIDTH_LONG); ADD_ATTR(DEVICE_, PREFERRED_VECTOR_WIDTH_FLOAT); ADD_ATTR(DEVICE_, PREFERRED_VECTOR_WIDTH_DOUBLE); ADD_ATTR(DEVICE_, MAX_CLOCK_FREQUENCY); ADD_ATTR(DEVICE_, ADDRESS_BITS); ADD_ATTR(DEVICE_, MAX_READ_IMAGE_ARGS); ADD_ATTR(DEVICE_, MAX_WRITE_IMAGE_ARGS); ADD_ATTR(DEVICE_, MAX_MEM_ALLOC_SIZE); ADD_ATTR(DEVICE_, IMAGE2D_MAX_WIDTH); ADD_ATTR(DEVICE_, IMAGE2D_MAX_HEIGHT); ADD_ATTR(DEVICE_, IMAGE3D_MAX_WIDTH); ADD_ATTR(DEVICE_, IMAGE3D_MAX_HEIGHT); ADD_ATTR(DEVICE_, IMAGE3D_MAX_DEPTH); ADD_ATTR(DEVICE_, IMAGE_SUPPORT); ADD_ATTR(DEVICE_, MAX_PARAMETER_SIZE); ADD_ATTR(DEVICE_, MAX_SAMPLERS); ADD_ATTR(DEVICE_, MEM_BASE_ADDR_ALIGN); ADD_ATTR(DEVICE_, MIN_DATA_TYPE_ALIGN_SIZE); ADD_ATTR(DEVICE_, SINGLE_FP_CONFIG); #ifdef CL_DEVICE_DOUBLE_FP_CONFIG ADD_ATTR(DEVICE_, DOUBLE_FP_CONFIG); #endif #ifdef CL_DEVICE_HALF_FP_CONFIG ADD_ATTR(DEVICE_, HALF_FP_CONFIG); #endif ADD_ATTR(DEVICE_, GLOBAL_MEM_CACHE_TYPE); ADD_ATTR(DEVICE_, GLOBAL_MEM_CACHELINE_SIZE); ADD_ATTR(DEVICE_, GLOBAL_MEM_CACHE_SIZE); ADD_ATTR(DEVICE_, GLOBAL_MEM_SIZE); ADD_ATTR(DEVICE_, MAX_CONSTANT_BUFFER_SIZE); ADD_ATTR(DEVICE_, MAX_CONSTANT_ARGS); ADD_ATTR(DEVICE_, LOCAL_MEM_TYPE); ADD_ATTR(DEVICE_, LOCAL_MEM_SIZE); ADD_ATTR(DEVICE_, ERROR_CORRECTION_SUPPORT); ADD_ATTR(DEVICE_, PROFILING_TIMER_RESOLUTION); ADD_ATTR(DEVICE_, ENDIAN_LITTLE); ADD_ATTR(DEVICE_, AVAILABLE); ADD_ATTR(DEVICE_, COMPILER_AVAILABLE); ADD_ATTR(DEVICE_, EXECUTION_CAPABILITIES); ADD_ATTR(DEVICE_, QUEUE_PROPERTIES); #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(DEVICE_, QUEUE_ON_HOST_PROPERTIES); #endif ADD_ATTR(DEVICE_, NAME); ADD_ATTR(DEVICE_, VENDOR); ADD_ATTR(, DRIVER_VERSION); ADD_ATTR(DEVICE_, VERSION); ADD_ATTR(DEVICE_, PROFILE); ADD_ATTR(DEVICE_, VERSION); ADD_ATTR(DEVICE_, EXTENSIONS); ADD_ATTR(DEVICE_, PLATFORM); #if PYOPENCL_CL_VERSION >= 0x1010 ADD_ATTR(DEVICE_, PREFERRED_VECTOR_WIDTH_HALF); ADD_ATTR(DEVICE_, HOST_UNIFIED_MEMORY); ADD_ATTR(DEVICE_, NATIVE_VECTOR_WIDTH_CHAR); ADD_ATTR(DEVICE_, NATIVE_VECTOR_WIDTH_SHORT); ADD_ATTR(DEVICE_, NATIVE_VECTOR_WIDTH_INT); ADD_ATTR(DEVICE_, NATIVE_VECTOR_WIDTH_LONG); ADD_ATTR(DEVICE_, NATIVE_VECTOR_WIDTH_FLOAT); ADD_ATTR(DEVICE_, NATIVE_VECTOR_WIDTH_DOUBLE); ADD_ATTR(DEVICE_, NATIVE_VECTOR_WIDTH_HALF); ADD_ATTR(DEVICE_, OPENCL_C_VERSION); #endif // support for cl_nv_device_attribute_query #ifdef CL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV ADD_ATTR(DEVICE_, COMPUTE_CAPABILITY_MAJOR_NV); ADD_ATTR(DEVICE_, COMPUTE_CAPABILITY_MINOR_NV); ADD_ATTR(DEVICE_, REGISTERS_PER_BLOCK_NV); ADD_ATTR(DEVICE_, WARP_SIZE_NV); ADD_ATTR(DEVICE_, GPU_OVERLAP_NV); ADD_ATTR(DEVICE_, KERNEL_EXEC_TIMEOUT_NV); ADD_ATTR(DEVICE_, INTEGRATED_MEMORY_NV); // Nvidia specific device attributes, not defined in Khronos CL/cl_ext.h #ifdef CL_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT_NV ADD_ATTR(DEVICE_, ATTRIBUTE_ASYNC_ENGINE_COUNT_NV); #endif #ifdef CL_DEVICE_PCI_BUS_ID_NV ADD_ATTR(DEVICE_, PCI_BUS_ID_NV); #endif #ifdef CL_DEVICE_PCI_SLOT_ID_NV ADD_ATTR(DEVICE_, PCI_SLOT_ID_NV); #endif #ifdef CL_DEVICE_PCI_SLOT_ID_NV ADD_ATTR(DEVICE_, PCI_DOMAIN_ID_NV); #endif #endif // {{{ cl_amd_device_attribute_query #ifdef CL_DEVICE_PROFILING_TIMER_OFFSET_AMD ADD_ATTR(DEVICE_, PROFILING_TIMER_OFFSET_AMD); #endif #ifdef CL_DEVICE_TOPOLOGY_AMD ADD_ATTR(DEVICE_, TOPOLOGY_AMD); #endif #ifdef CL_DEVICE_BOARD_NAME_AMD ADD_ATTR(DEVICE_, BOARD_NAME_AMD); #endif #ifdef CL_DEVICE_GLOBAL_FREE_MEMORY_AMD ADD_ATTR(DEVICE_, GLOBAL_FREE_MEMORY_AMD); #endif #ifdef CL_DEVICE_SIMD_PER_COMPUTE_UNIT_AMD ADD_ATTR(DEVICE_, SIMD_PER_COMPUTE_UNIT_AMD); #endif #ifdef CL_DEVICE_SIMD_WIDTH_AMD ADD_ATTR(DEVICE_, SIMD_WIDTH_AMD); #endif #ifdef CL_DEVICE_SIMD_INSTRUCTION_WIDTH_AMD ADD_ATTR(DEVICE_, SIMD_INSTRUCTION_WIDTH_AMD); #endif #ifdef CL_DEVICE_WAVEFRONT_WIDTH_AMD ADD_ATTR(DEVICE_, WAVEFRONT_WIDTH_AMD); #endif #ifdef CL_DEVICE_GLOBAL_MEM_CHANNELS_AMD ADD_ATTR(DEVICE_, GLOBAL_MEM_CHANNELS_AMD); #endif #ifdef CL_DEVICE_GLOBAL_MEM_CHANNEL_BANKS_AMD ADD_ATTR(DEVICE_, GLOBAL_MEM_CHANNEL_BANKS_AMD); #endif #ifdef CL_DEVICE_GLOBAL_MEM_CHANNEL_BANK_WIDTH_AMD ADD_ATTR(DEVICE_, GLOBAL_MEM_CHANNEL_BANK_WIDTH_AMD); #endif #ifdef CL_DEVICE_LOCAL_MEM_SIZE_PER_COMPUTE_UNIT_AMD ADD_ATTR(DEVICE_, LOCAL_MEM_SIZE_PER_COMPUTE_UNIT_AMD); #endif #ifdef CL_DEVICE_LOCAL_MEM_BANKS_AMD ADD_ATTR(DEVICE_, LOCAL_MEM_BANKS_AMD); #endif #ifdef CL_DEVICE_THREAD_TRACE_SUPPORTED_AMD ADD_ATTR(DEVICE_, THREAD_TRACE_SUPPORTED_AMD); #endif #ifdef CL_DEVICE_GFXIP_MAJOR_AMD ADD_ATTR(DEVICE_, GFXIP_MAJOR_AMD); #endif #ifdef CL_DEVICE_GFXIP_MINOR_AMD ADD_ATTR(DEVICE_, GFXIP_MINOR_AMD); #endif #ifdef CL_DEVICE_AVAILABLE_ASYNC_QUEUES_AMD ADD_ATTR(DEVICE_, AVAILABLE_ASYNC_QUEUES_AMD); #endif #ifdef CL_DEVICE_PREFERRED_WORK_GROUP_SIZE_AMD ADD_ATTR(DEVICE_, PREFERRED_WORK_GROUP_SIZE_AMD); #endif #ifdef CL_DEVICE_MAX_WORK_GROUP_SIZE_AMD ADD_ATTR(DEVICE_, MAX_WORK_GROUP_SIZE_AMD); #endif #ifdef CL_DEVICE_PREFERRED_CONSTANT_BUFFER_SIZE_AMD ADD_ATTR(DEVICE_, PREFERRED_CONSTANT_BUFFER_SIZE_AMD); #endif #ifdef CL_DEVICE_PCIE_ID_AMD ADD_ATTR(DEVICE_, PCIE_ID_AMD); #endif // }}} #ifdef CL_DEVICE_MAX_ATOMIC_COUNTERS_EXT ADD_ATTR(DEVICE_, MAX_ATOMIC_COUNTERS_EXT); #endif #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(DEVICE_, LINKER_AVAILABLE); ADD_ATTR(DEVICE_, BUILT_IN_KERNELS); ADD_ATTR(DEVICE_, IMAGE_MAX_BUFFER_SIZE); ADD_ATTR(DEVICE_, IMAGE_MAX_ARRAY_SIZE); ADD_ATTR(DEVICE_, PARENT_DEVICE); ADD_ATTR(DEVICE_, PARTITION_MAX_SUB_DEVICES); ADD_ATTR(DEVICE_, PARTITION_PROPERTIES); ADD_ATTR(DEVICE_, PARTITION_AFFINITY_DOMAIN); ADD_ATTR(DEVICE_, PARTITION_TYPE); ADD_ATTR(DEVICE_, REFERENCE_COUNT); ADD_ATTR(DEVICE_, PREFERRED_INTEROP_USER_SYNC); ADD_ATTR(DEVICE_, PRINTF_BUFFER_SIZE); #endif #ifdef cl_khr_image2d_from_buffer ADD_ATTR(DEVICE_, IMAGE_PITCH_ALIGNMENT); ADD_ATTR(DEVICE_, IMAGE_BASE_ADDRESS_ALIGNMENT); #endif #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(DEVICE_, MAX_READ_WRITE_IMAGE_ARGS); ADD_ATTR(DEVICE_, MAX_GLOBAL_VARIABLE_SIZE); ADD_ATTR(DEVICE_, QUEUE_ON_DEVICE_PROPERTIES); ADD_ATTR(DEVICE_, QUEUE_ON_DEVICE_PREFERRED_SIZE); ADD_ATTR(DEVICE_, QUEUE_ON_DEVICE_MAX_SIZE); ADD_ATTR(DEVICE_, MAX_ON_DEVICE_QUEUES); ADD_ATTR(DEVICE_, MAX_ON_DEVICE_EVENTS); ADD_ATTR(DEVICE_, SVM_CAPABILITIES); ADD_ATTR(DEVICE_, GLOBAL_VARIABLE_PREFERRED_TOTAL_SIZE); ADD_ATTR(DEVICE_, MAX_PIPE_ARGS); ADD_ATTR(DEVICE_, PIPE_MAX_ACTIVE_RESERVATIONS); ADD_ATTR(DEVICE_, PIPE_MAX_PACKET_SIZE); ADD_ATTR(DEVICE_, PREFERRED_PLATFORM_ATOMIC_ALIGNMENT); ADD_ATTR(DEVICE_, PREFERRED_GLOBAL_ATOMIC_ALIGNMENT); ADD_ATTR(DEVICE_, PREFERRED_LOCAL_ATOMIC_ALIGNMENT); #endif #if PYOPENCL_CL_VERSION >= 0x2010 ADD_ATTR(DEVICE_, IL_VERSION); ADD_ATTR(DEVICE_, MAX_NUM_SUB_GROUPS); ADD_ATTR(DEVICE_, SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS); #endif #if PYOPENCL_CL_VERSION >= 0x3000 ADD_ATTR(DEVICE_, NUMERIC_VERSION); ADD_ATTR(DEVICE_, EXTENSIONS_WITH_VERSION); ADD_ATTR(DEVICE_, ILS_WITH_VERSION); ADD_ATTR(DEVICE_, BUILT_IN_KERNELS_WITH_VERSION); ADD_ATTR(DEVICE_, ATOMIC_MEMORY_CAPABILITIES); ADD_ATTR(DEVICE_, ATOMIC_FENCE_CAPABILITIES); ADD_ATTR(DEVICE_, NON_UNIFORM_WORK_GROUP_SUPPORT); ADD_ATTR(DEVICE_, OPENCL_C_ALL_VERSIONS); ADD_ATTR(DEVICE_, PREFERRED_WORK_GROUP_SIZE_MULTIPLE); ADD_ATTR(DEVICE_, WORK_GROUP_COLLECTIVE_FUNCTIONS_SUPPORT); ADD_ATTR(DEVICE_, GENERIC_ADDRESS_SPACE_SUPPORT); ADD_ATTR(DEVICE_, OPENCL_C_FEATURES); #ifdef CL_DEVICE_DEVICE_ENQUEUE_SUPPORT // some busted headers shipped by Debian have this cls.attr("DEVICE_ENQUEUE_CAPABILITIES") = CL_DEVICE_DEVICE_ENQUEUE_SUPPORT; #else ADD_ATTR(DEVICE_, DEVICE_ENQUEUE_CAPABILITIES); #endif ADD_ATTR(DEVICE_, PIPE_SUPPORT); #endif /* cl_intel_advanced_motion_estimation */ #ifdef CL_DEVICE_ME_VERSION_INTEL ADD_ATTR(DEVICE_, ME_VERSION_INTEL); #endif /* cl_qcom_ext_host_ptr */ #ifdef CL_DEVICE_EXT_MEM_PADDING_IN_BYTES_QCOM ADD_ATTR(DEVICE_, EXT_MEM_PADDING_IN_BYTES_QCOM); #endif #ifdef CL_DEVICE_PAGE_SIZE_QCOM ADD_ATTR(DEVICE_, PAGE_SIZE_QCOM); #endif /* cl_khr_spir */ #ifdef CL_DEVICE_SPIR_VERSIONS ADD_ATTR(DEVICE_, SPIR_VERSIONS); #endif /* cl_altera_device_temperature */ #ifdef CL_DEVICE_CORE_TEMPERATURE_ALTERA ADD_ATTR(DEVICE_, CORE_TEMPERATURE_ALTERA); #endif /* cl_intel_simultaneous_sharing */ #ifdef CL_DEVICE_SIMULTANEOUS_INTEROPS_INTEL ADD_ATTR(DEVICE_, SIMULTANEOUS_INTEROPS_INTEL); #endif #ifdef CL_DEVICE_NUM_SIMULTANEOUS_INTEROPS_INTEL ADD_ATTR(DEVICE_, NUM_SIMULTANEOUS_INTEROPS_INTEL); #endif } { py::class_ cls(m, "device_topology_type_amd"); #ifdef CL_DEVICE_TOPOLOGY_TYPE_PCIE_AMD cls.attr("PCIE") = CL_DEVICE_TOPOLOGY_TYPE_PCIE_AMD; #endif } { py::class_ cls(m, "device_fp_config"); ADD_ATTR(FP_, DENORM); ADD_ATTR(FP_, INF_NAN); ADD_ATTR(FP_, ROUND_TO_NEAREST); ADD_ATTR(FP_, ROUND_TO_ZERO); ADD_ATTR(FP_, ROUND_TO_INF); ADD_ATTR(FP_, FMA); #if PYOPENCL_CL_VERSION >= 0x1010 ADD_ATTR(FP_, SOFT_FLOAT); #endif #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(FP_, CORRECTLY_ROUNDED_DIVIDE_SQRT); #endif } { py::class_ cls(m, "device_mem_cache_type"); ADD_ATTR( , NONE); ADD_ATTR( , READ_ONLY_CACHE); ADD_ATTR( , READ_WRITE_CACHE); } { py::class_ cls(m, "device_local_mem_type"); ADD_ATTR( , LOCAL); ADD_ATTR( , GLOBAL); } { py::class_ cls(m, "device_exec_capabilities"); ADD_ATTR(EXEC_, KERNEL); ADD_ATTR(EXEC_, NATIVE_KERNEL); #ifdef CL_EXEC_IMMEDIATE_EXECUTION_INTEL ADD_ATTR(EXEC_, IMMEDIATE_EXECUTION_INTEL); #endif } { py::class_ cls(m, "device_svm_capabilities"); #if PYOPENCL_CL_VERSION >= 0x2000 // device_svm_capabilities ADD_ATTR(DEVICE_SVM_, COARSE_GRAIN_BUFFER); ADD_ATTR(DEVICE_SVM_, FINE_GRAIN_BUFFER); ADD_ATTR(DEVICE_SVM_, FINE_GRAIN_SYSTEM); ADD_ATTR(DEVICE_SVM_, ATOMICS); #endif } { py::class_ cls(m, "command_queue_properties"); ADD_ATTR(QUEUE_, OUT_OF_ORDER_EXEC_MODE_ENABLE); ADD_ATTR(QUEUE_, PROFILING_ENABLE); #ifdef CL_QUEUE_IMMEDIATE_EXECUTION_ENABLE_INTEL ADD_ATTR(QUEUE_, IMMEDIATE_EXECUTION_ENABLE_INTEL); #endif #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(QUEUE_, ON_DEVICE); ADD_ATTR(QUEUE_, ON_DEVICE_DEFAULT); #endif } { py::class_ cls(m, "context_info"); ADD_ATTR(CONTEXT_, REFERENCE_COUNT); ADD_ATTR(CONTEXT_, DEVICES); ADD_ATTR(CONTEXT_, PROPERTIES); #if PYOPENCL_CL_VERSION >= 0x1010 ADD_ATTR(CONTEXT_, NUM_DEVICES); #endif #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(CONTEXT_, INTEROP_USER_SYNC); #endif } { py::class_ cls(m, "gl_context_info"); #if defined(cl_khr_gl_sharing) && (cl_khr_gl_sharing >= 1) ADD_ATTR(, CURRENT_DEVICE_FOR_GL_CONTEXT_KHR); ADD_ATTR(, DEVICES_FOR_GL_CONTEXT_KHR); #endif } { py::class_ cls(m, "context_properties"); ADD_ATTR(CONTEXT_, PLATFORM); #if defined(cl_khr_gl_sharing) && (cl_khr_gl_sharing >= 1) ADD_ATTR( ,GL_CONTEXT_KHR); ADD_ATTR( ,EGL_DISPLAY_KHR); ADD_ATTR( ,GLX_DISPLAY_KHR); ADD_ATTR( ,WGL_HDC_KHR); ADD_ATTR( ,CGL_SHAREGROUP_KHR); #endif #if defined(__APPLE__) && defined(HAVE_GL) ADD_ATTR( ,CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE); #endif /* __APPLE__ */ // cl_amd_offline_devices #ifdef CL_CONTEXT_OFFLINE_DEVICES_AMD ADD_ATTR(CONTEXT_, OFFLINE_DEVICES_AMD); #endif } { py::class_ cls(m, "command_queue_info"); ADD_ATTR(QUEUE_, CONTEXT); ADD_ATTR(QUEUE_, DEVICE); ADD_ATTR(QUEUE_, REFERENCE_COUNT); ADD_ATTR(QUEUE_, PROPERTIES); #if PYOPENCL_CL_VERSION >= 0x3000 ADD_ATTR(QUEUE_, PROPERTIES_ARRAY); #endif } { py::class_ cls(m, "queue_properties"); #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(QUEUE_, PROPERTIES); ADD_ATTR(QUEUE_, SIZE); #endif #if PYOPENCL_CL_VERSION >= 0x2010 ADD_ATTR(QUEUE_, DEVICE_DEFAULT); #endif } { py::class_ cls(m, "mem_flags"); ADD_ATTR(MEM_, READ_WRITE); ADD_ATTR(MEM_, WRITE_ONLY); ADD_ATTR(MEM_, READ_ONLY); ADD_ATTR(MEM_, USE_HOST_PTR); ADD_ATTR(MEM_, ALLOC_HOST_PTR); ADD_ATTR(MEM_, COPY_HOST_PTR); #ifdef cl_amd_device_memory_flags ADD_ATTR(MEM_, USE_PERSISTENT_MEM_AMD); #endif #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(MEM_, HOST_WRITE_ONLY); ADD_ATTR(MEM_, HOST_READ_ONLY); ADD_ATTR(MEM_, HOST_NO_ACCESS); #endif #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(MEM_, KERNEL_READ_AND_WRITE); #endif } { py::class_ cls(m, "svm_mem_flags"); #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(MEM_, READ_WRITE); ADD_ATTR(MEM_, WRITE_ONLY); ADD_ATTR(MEM_, READ_ONLY); ADD_ATTR(MEM_, SVM_FINE_GRAIN_BUFFER); ADD_ATTR(MEM_, SVM_ATOMICS); #endif } { py::class_ cls(m, "channel_order"); ADD_ATTR( , R); ADD_ATTR( , A); ADD_ATTR( , RG); ADD_ATTR( , RA); ADD_ATTR( , RGB); ADD_ATTR( , RGBA); ADD_ATTR( , BGRA); ADD_ATTR( , INTENSITY); ADD_ATTR( , LUMINANCE); #if PYOPENCL_CL_VERSION >= 0x1010 ADD_ATTR( , Rx); ADD_ATTR( , RGx); ADD_ATTR( , RGBx); #endif #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR( , sRGB); ADD_ATTR( , sRGBx); ADD_ATTR( , sRGBA); ADD_ATTR( , sBGRA); ADD_ATTR( , ABGR); #endif } { py::class_ cls(m, "channel_type"); ADD_ATTR( , SNORM_INT8); ADD_ATTR( , SNORM_INT16); ADD_ATTR( , UNORM_INT8); ADD_ATTR( , UNORM_INT16); ADD_ATTR( , UNORM_SHORT_565); ADD_ATTR( , UNORM_SHORT_555); ADD_ATTR( , UNORM_INT_101010); ADD_ATTR( , SIGNED_INT8); ADD_ATTR( , SIGNED_INT16); ADD_ATTR( , SIGNED_INT32); ADD_ATTR( , UNSIGNED_INT8); ADD_ATTR( , UNSIGNED_INT16); ADD_ATTR( , UNSIGNED_INT32); ADD_ATTR( , HALF_FLOAT); ADD_ATTR( , FLOAT); #if PYOPENCL_CL_VERSION >= 0x1020 && defined(cl_khr_gl_sharing) ADD_ATTR( , UNORM_INT24); #endif #if PYOPENCL_CL_VERSION >= 0x2010 ADD_ATTR( , UNORM_INT_101010_2); #endif } { py::class_ cls(m, "mem_object_type"); ADD_ATTR(MEM_OBJECT_, BUFFER); ADD_ATTR(MEM_OBJECT_, IMAGE2D); ADD_ATTR(MEM_OBJECT_, IMAGE3D); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(MEM_OBJECT_, IMAGE2D_ARRAY); ADD_ATTR(MEM_OBJECT_, IMAGE1D); ADD_ATTR(MEM_OBJECT_, IMAGE1D_ARRAY); ADD_ATTR(MEM_OBJECT_, IMAGE1D_BUFFER); #endif #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(MEM_OBJECT_, PIPE); #endif } { py::class_ cls(m, "mem_info"); ADD_ATTR(MEM_, TYPE); ADD_ATTR(MEM_, FLAGS); ADD_ATTR(MEM_, SIZE); ADD_ATTR(MEM_, HOST_PTR); ADD_ATTR(MEM_, MAP_COUNT); ADD_ATTR(MEM_, REFERENCE_COUNT); ADD_ATTR(MEM_, CONTEXT); #if PYOPENCL_CL_VERSION >= 0x1010 ADD_ATTR(MEM_, ASSOCIATED_MEMOBJECT); ADD_ATTR(MEM_, OFFSET); #endif #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(MEM_, USES_SVM_POINTER); #endif #if PYOPENCL_CL_VERSION >= 0x3000 ADD_ATTR(MEM_, PROPERTIES); #endif } { py::class_ cls(m, "image_info"); ADD_ATTR(IMAGE_, FORMAT); ADD_ATTR(IMAGE_, ELEMENT_SIZE); ADD_ATTR(IMAGE_, ROW_PITCH); ADD_ATTR(IMAGE_, SLICE_PITCH); ADD_ATTR(IMAGE_, WIDTH); ADD_ATTR(IMAGE_, HEIGHT); ADD_ATTR(IMAGE_, DEPTH); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(IMAGE_, ARRAY_SIZE); ADD_ATTR(IMAGE_, BUFFER); ADD_ATTR(IMAGE_, NUM_MIP_LEVELS); ADD_ATTR(IMAGE_, NUM_SAMPLES); #endif } { py::class_ cls(m, "pipe_info"); #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(PIPE_, PACKET_SIZE); ADD_ATTR(PIPE_, MAX_PACKETS); #endif #if PYOPENCL_CL_VERSION >= 0x3000 ADD_ATTR(PIPE_, PROPERTIES); #endif } { py::class_ cls(m, "pipe_properties"); #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(PIPE_, PACKET_SIZE); ADD_ATTR(PIPE_, MAX_PACKETS); #endif } { py::class_ cls(m, "addressing_mode"); ADD_ATTR(ADDRESS_, NONE); ADD_ATTR(ADDRESS_, CLAMP_TO_EDGE); ADD_ATTR(ADDRESS_, CLAMP); ADD_ATTR(ADDRESS_, REPEAT); #if PYOPENCL_CL_VERSION >= 0x1010 ADD_ATTR(ADDRESS_, MIRRORED_REPEAT); #endif } { py::class_ cls(m, "filter_mode"); ADD_ATTR(FILTER_, NEAREST); ADD_ATTR(FILTER_, LINEAR); } { py::class_ cls(m, "sampler_info"); ADD_ATTR(SAMPLER_, REFERENCE_COUNT); ADD_ATTR(SAMPLER_, CONTEXT); ADD_ATTR(SAMPLER_, NORMALIZED_COORDS); ADD_ATTR(SAMPLER_, ADDRESSING_MODE); ADD_ATTR(SAMPLER_, FILTER_MODE); #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(SAMPLER_, MIP_FILTER_MODE); ADD_ATTR(SAMPLER_, LOD_MIN); ADD_ATTR(SAMPLER_, LOD_MAX); #endif #if PYOPENCL_CL_VERSION >= 0x3000 ADD_ATTR(SAMPLER_, PROPERTIES); #endif // {{{ cl_khr_mipmap_image #ifdef CL_SAMPLER_MIP_FILTER_MODE_KHR ADD_ATTR(SAMPLER_, MIP_FILTER_MODE_KHR); ADD_ATTR(SAMPLER_, LOD_MIN_KHR); ADD_ATTR(SAMPLER_, LOD_MAX_KHR); #endif // }}} } { py::class_ cls(m, "sampler_properties"); ADD_ATTR(SAMPLER_, NORMALIZED_COORDS); ADD_ATTR(SAMPLER_, ADDRESSING_MODE); ADD_ATTR(SAMPLER_, FILTER_MODE); } { py::class_ cls(m, "map_flags"); ADD_ATTR(MAP_, READ); ADD_ATTR(MAP_, WRITE); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(MAP_, WRITE_INVALIDATE_REGION); #endif } { py::class_ cls(m, "program_info"); ADD_ATTR(PROGRAM_, REFERENCE_COUNT); ADD_ATTR(PROGRAM_, CONTEXT); ADD_ATTR(PROGRAM_, NUM_DEVICES); ADD_ATTR(PROGRAM_, DEVICES); ADD_ATTR(PROGRAM_, SOURCE); ADD_ATTR(PROGRAM_, BINARY_SIZES); ADD_ATTR(PROGRAM_, BINARIES); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(PROGRAM_, NUM_KERNELS); ADD_ATTR(PROGRAM_, KERNEL_NAMES); #endif #if PYOPENCL_CL_VERSION >= 0x2010 ADD_ATTR(PROGRAM_, IL); #endif #if PYOPENCL_CL_VERSION >= 0x2020 ADD_ATTR(PROGRAM_, SCOPE_GLOBAL_CTORS_PRESENT); ADD_ATTR(PROGRAM_, SCOPE_GLOBAL_DTORS_PRESENT); #endif } { py::class_ cls(m, "program_build_info"); ADD_ATTR(PROGRAM_BUILD_, STATUS); ADD_ATTR(PROGRAM_BUILD_, OPTIONS); ADD_ATTR(PROGRAM_BUILD_, LOG); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(PROGRAM_, BINARY_TYPE); #endif #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(PROGRAM_BUILD_, GLOBAL_VARIABLE_TOTAL_SIZE); #endif } { py::class_ cls(m, "program_binary_type"); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(PROGRAM_BINARY_TYPE_, NONE); ADD_ATTR(PROGRAM_BINARY_TYPE_, COMPILED_OBJECT); ADD_ATTR(PROGRAM_BINARY_TYPE_, LIBRARY); ADD_ATTR(PROGRAM_BINARY_TYPE_, EXECUTABLE); #endif } { py::class_ cls(m, "kernel_info"); ADD_ATTR(KERNEL_, FUNCTION_NAME); ADD_ATTR(KERNEL_, NUM_ARGS); ADD_ATTR(KERNEL_, REFERENCE_COUNT); ADD_ATTR(KERNEL_, CONTEXT); ADD_ATTR(KERNEL_, PROGRAM); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(KERNEL_, ATTRIBUTES); #endif } { py::class_ cls(m, "kernel_arg_info"); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(KERNEL_ARG_, ADDRESS_QUALIFIER); ADD_ATTR(KERNEL_ARG_, ACCESS_QUALIFIER); ADD_ATTR(KERNEL_ARG_, TYPE_NAME); ADD_ATTR(KERNEL_ARG_, TYPE_QUALIFIER); ADD_ATTR(KERNEL_ARG_, NAME); #endif } { py::class_ cls( m, "kernel_arg_address_qualifier"); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(KERNEL_ARG_ADDRESS_, GLOBAL); ADD_ATTR(KERNEL_ARG_ADDRESS_, LOCAL); ADD_ATTR(KERNEL_ARG_ADDRESS_, CONSTANT); ADD_ATTR(KERNEL_ARG_ADDRESS_, PRIVATE); #endif } { py::class_ cls( m, "kernel_arg_access_qualifier"); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(KERNEL_ARG_ACCESS_, READ_ONLY); ADD_ATTR(KERNEL_ARG_ACCESS_, WRITE_ONLY); ADD_ATTR(KERNEL_ARG_ACCESS_, READ_WRITE); ADD_ATTR(KERNEL_ARG_ACCESS_, NONE); #endif } { py::class_ cls( m, "kernel_arg_type_qualifier"); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(KERNEL_ARG_TYPE_, NONE); ADD_ATTR(KERNEL_ARG_TYPE_, CONST); ADD_ATTR(KERNEL_ARG_TYPE_, RESTRICT); ADD_ATTR(KERNEL_ARG_TYPE_, VOLATILE); #endif #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(KERNEL_ARG_TYPE_, PIPE); #endif } { py::class_ cls(m, "kernel_work_group_info"); ADD_ATTR(KERNEL_, WORK_GROUP_SIZE); ADD_ATTR(KERNEL_, COMPILE_WORK_GROUP_SIZE); ADD_ATTR(KERNEL_, LOCAL_MEM_SIZE); #if PYOPENCL_CL_VERSION >= 0x1010 ADD_ATTR(KERNEL_, PREFERRED_WORK_GROUP_SIZE_MULTIPLE); ADD_ATTR(KERNEL_, PRIVATE_MEM_SIZE); #endif #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(KERNEL_, GLOBAL_WORK_SIZE); #endif } { py::class_ cls(m, "kernel_sub_group_info"); #if PYOPENCL_CL_VERSION >= 0x2010 ADD_ATTR(KERNEL_, MAX_SUB_GROUP_SIZE_FOR_NDRANGE); ADD_ATTR(KERNEL_, SUB_GROUP_COUNT_FOR_NDRANGE); ADD_ATTR(KERNEL_, LOCAL_SIZE_FOR_SUB_GROUP_COUNT); ADD_ATTR(KERNEL_, MAX_NUM_SUB_GROUPS); ADD_ATTR(KERNEL_, COMPILE_NUM_SUB_GROUPS); #endif } { py::class_ cls(m, "event_info"); ADD_ATTR(EVENT_, COMMAND_QUEUE); ADD_ATTR(EVENT_, COMMAND_TYPE); ADD_ATTR(EVENT_, REFERENCE_COUNT); ADD_ATTR(EVENT_, COMMAND_EXECUTION_STATUS); #if PYOPENCL_CL_VERSION >= 0x1010 ADD_ATTR(EVENT_, CONTEXT); #endif } { py::class_ cls(m, "command_type"); ADD_ATTR(COMMAND_, NDRANGE_KERNEL); ADD_ATTR(COMMAND_, TASK); ADD_ATTR(COMMAND_, NATIVE_KERNEL); ADD_ATTR(COMMAND_, READ_BUFFER); ADD_ATTR(COMMAND_, WRITE_BUFFER); ADD_ATTR(COMMAND_, COPY_BUFFER); ADD_ATTR(COMMAND_, READ_IMAGE); ADD_ATTR(COMMAND_, WRITE_IMAGE); ADD_ATTR(COMMAND_, COPY_IMAGE); ADD_ATTR(COMMAND_, COPY_IMAGE_TO_BUFFER); ADD_ATTR(COMMAND_, COPY_BUFFER_TO_IMAGE); ADD_ATTR(COMMAND_, MAP_BUFFER); ADD_ATTR(COMMAND_, MAP_IMAGE); ADD_ATTR(COMMAND_, UNMAP_MEM_OBJECT); ADD_ATTR(COMMAND_, MARKER); ADD_ATTR(COMMAND_, ACQUIRE_GL_OBJECTS); ADD_ATTR(COMMAND_, RELEASE_GL_OBJECTS); #if PYOPENCL_CL_VERSION >= 0x1010 ADD_ATTR(COMMAND_, READ_BUFFER_RECT); ADD_ATTR(COMMAND_, WRITE_BUFFER_RECT); ADD_ATTR(COMMAND_, COPY_BUFFER_RECT); ADD_ATTR(COMMAND_, USER); #endif #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(COMMAND_, BARRIER); ADD_ATTR(COMMAND_, MIGRATE_MEM_OBJECTS); ADD_ATTR(COMMAND_, FILL_BUFFER); ADD_ATTR(COMMAND_, FILL_IMAGE); #endif #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(COMMAND_, SVM_FREE); ADD_ATTR(COMMAND_, SVM_MEMCPY); ADD_ATTR(COMMAND_, SVM_MEMFILL); ADD_ATTR(COMMAND_, SVM_MAP); ADD_ATTR(COMMAND_, SVM_UNMAP); #endif #if PYOPENCL_CL_VERSION >= 0x3000 ADD_ATTR(COMMAND_, SVM_MIGRATE_MEM); #endif } { py::class_ cls(m, "command_execution_status"); ADD_ATTR(, COMPLETE); ADD_ATTR(, RUNNING); ADD_ATTR(, SUBMITTED); ADD_ATTR(, QUEUED); } { py::class_ cls(m, "profiling_info"); ADD_ATTR(PROFILING_COMMAND_, QUEUED); ADD_ATTR(PROFILING_COMMAND_, SUBMIT); ADD_ATTR(PROFILING_COMMAND_, START); ADD_ATTR(PROFILING_COMMAND_, END); #if PYOPENCL_CL_VERSION >= 0x2000 ADD_ATTR(PROFILING_COMMAND_, COMPLETE); #endif } /* not needed--filled in automatically by implementation. #if PYOPENCL_CL_VERSION >= 0x1010 { py::class_ cls(m, "buffer_create_type"); ADD_ATTR(BUFFER_CREATE_TYPE_, REGION); } #endif */ { py::class_ cls( m, "mem_migration_flags"); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(MIGRATE_MEM_OBJECT_, HOST); ADD_ATTR(MIGRATE_MEM_OBJECT_, CONTENT_UNDEFINED); #endif } { py::class_ cls( m, "device_partition_property"); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(DEVICE_PARTITION_, EQUALLY); ADD_ATTR(DEVICE_PARTITION_, BY_COUNTS); ADD_ATTR(DEVICE_PARTITION_, BY_COUNTS_LIST_END); ADD_ATTR(DEVICE_PARTITION_, BY_AFFINITY_DOMAIN); #endif } { py::class_ cls(m, "device_affinity_domain"); #if PYOPENCL_CL_VERSION >= 0x1020 ADD_ATTR(DEVICE_AFFINITY_DOMAIN_, NUMA); ADD_ATTR(DEVICE_AFFINITY_DOMAIN_, L4_CACHE); ADD_ATTR(DEVICE_AFFINITY_DOMAIN_, L3_CACHE); ADD_ATTR(DEVICE_AFFINITY_DOMAIN_, L2_CACHE); ADD_ATTR(DEVICE_AFFINITY_DOMAIN_, L1_CACHE); ADD_ATTR(DEVICE_AFFINITY_DOMAIN_, NEXT_PARTITIONABLE); #endif } { py::class_ cls(m, "device_atomic_capabilities"); #if PYOPENCL_CL_VERSION >= 0x3000 ADD_ATTR(DEVICE_ATOMIC_, ORDER_RELAXED); ADD_ATTR(DEVICE_ATOMIC_, ORDER_ACQ_REL); ADD_ATTR(DEVICE_ATOMIC_, ORDER_SEQ_CST); ADD_ATTR(DEVICE_ATOMIC_, SCOPE_WORK_ITEM); ADD_ATTR(DEVICE_ATOMIC_, SCOPE_WORK_GROUP); ADD_ATTR(DEVICE_ATOMIC_, SCOPE_DEVICE); ADD_ATTR(DEVICE_ATOMIC_, SCOPE_ALL_DEVICES); #endif } { py::class_ cls(m, "device_device_enqueue_capabilities"); #if (PYOPENCL_CL_VERSION >= 0x3000) && defined(CL_DEVICE_DEVICE_ENQUEUE_CAPABILITIES) ADD_ATTR(DEVICE_QUEUE_, SUPPORTED); ADD_ATTR(DEVICE_QUEUE_, REPLACEABLE_DEFAULT); #endif } { py::class_ cls(m, "version_bits"); #if PYOPENCL_CL_VERSION >= 0x3000 ADD_ATTR(VERSION_, MAJOR_BITS); ADD_ATTR(VERSION_, MINOR_BITS); ADD_ATTR(VERSION_, PATCH_BITS); ADD_ATTR(VERSION_, MAJOR_MASK); ADD_ATTR(VERSION_, MINOR_MASK); ADD_ATTR(VERSION_, PATCH_MASK); #endif } { py::class_ cls(m, "khronos_vendor_id"); #if PYOPENCL_CL_VERSION >= 0x3000 ADD_ATTR(KHRONOS_VENDOR_ID_, CODEPLAY); #endif } #ifdef HAVE_GL { py::class_ cls(m, "gl_object_type"); ADD_ATTR(GL_OBJECT_, BUFFER); ADD_ATTR(GL_OBJECT_, TEXTURE2D); ADD_ATTR(GL_OBJECT_, TEXTURE3D); ADD_ATTR(GL_OBJECT_, RENDERBUFFER); } { py::class_ cls(m, "gl_texture_info"); ADD_ATTR(GL_, TEXTURE_TARGET); ADD_ATTR(GL_, MIPMAP_LEVEL); } #endif // }}} // {{{ cl_name_version #if PYOPENCL_CL_VERSION >= 0x3000 { typedef cl_name_version cls; py::class_(m, "NameVersion") .def("__init__", [](cls *self, cl_version version, const std::string &name) { self->version = version; self->name[0] = '\0'; // https://stackoverflow.com/a/1258577 strncat(self->name, name.c_str(), CL_NAME_VERSION_MAX_NAME_SIZE-1); }, py::arg("version")=0, py::arg("name")=0 ) .def_prop_rw("version", [](cls &t) { return t.version; }, [](cls &t, cl_version val) { t.version = val; }) .def_prop_rw("name", [](cls &t) { return t.name; }, [](cls &t, const std::string &name) { t.name[0] = '\0'; // https://stackoverflow.com/a/1258577 strncat(t.name, name.c_str(), CL_NAME_VERSION_MAX_NAME_SIZE-1); }) ; } #endif // }}} // {{{ CL_DEVICE_TOPOLOGY_AMD #ifdef CL_DEVICE_TOPOLOGY_AMD { typedef cl_device_topology_amd cls; py::class_(m, "DeviceTopologyAmd") .def("__init__", // FIXME: Nanobind thinks of 'char' as "short string", not small integer. // The detour via cl_int may lose data on assignment. // [](cl_char bus, cl_char device, cl_char function) [](cls *self, cl_int bus, cl_int device, cl_int function) { self->pcie.type = CL_DEVICE_TOPOLOGY_TYPE_PCIE_AMD; self->pcie.bus = (cl_char) bus; self->pcie.device = (cl_char) device; self->pcie.function = (cl_char) function; }, py::arg("bus")=0, py::arg("device")=0, py::arg("function")=0) .def_prop_rw("type", [](cls &t) { return t.pcie.type; }, [](cls &t, cl_uint val) { t.pcie.type = val; }) .def_prop_rw("bus", [](cls &t) { return t.pcie.bus; }, // FIXME: Revert to cl_char when possible [](cls &t, cl_int val) { t.pcie.bus = (cl_char) val; }) .def_prop_rw("device", [](cls &t) { return t.pcie.device; }, // FIXME: Revert to cl_char when possible [](cls &t, cl_int val) { t.pcie.device = (cl_char) val; }) .def_prop_rw("function", [](cls &t) { return t.pcie.function; }, // FIXME: Revert to cl_char when possible [](cls &t, cl_int val) { t.pcie.function = (cl_char) val; }) ; } #endif // }}} } // vim: foldmethod=marker pyopencl-2025.1/src/wrap_helpers.hpp0000644000000000000000000001473214332717401014366 0ustar00// Wrapper-helping odds and ends // // Copyright (C) 2009 Andreas Kloeckner // // Permission is hereby granted, free of charge, to any person // obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights to use, // copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the // Software is furnished to do so, subject to the following // conditions: // // The above copyright notice and this permission notice shall be // included in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. #ifndef PYCUDA_WRAP_HELPERS_HEADER_SEEN #define PYCUDA_WRAP_HELPERS_HEADER_SEEN #include #include #include #include #include namespace py = nanobind; #define ENUM_VALUE(NAME) \ value(#NAME, NAME) // {{{ DEF_SIMPLE_XXX #define DEF_SIMPLE_METHOD(NAME) \ def(#NAME, &cls::NAME) #define DEF_SIMPLE_STATIC_METHOD(NAME) \ def_static(#NAME, &cls::NAME) #define DEF_SIMPLE_METHOD_WITH_ARGS(NAME, ARGS) \ def(#NAME, &cls::NAME, boost::python::args ARGS) #define DEF_SIMPLE_FUNCTION(NAME) \ m.def(#NAME, &NAME) #define DEF_SIMPLE_FUNCTION_WITH_ARGS(NAME, ARGS) \ m.def(#NAME, &NAME, py::args ARGS) #define DEF_SIMPLE_RO_MEMBER(NAME) \ def_readonly(#NAME, &cls::m_##NAME) #define DEF_SIMPLE_RW_MEMBER(NAME) \ def_readwrite(#NAME, &cls::m_##NAME) // }}} // {{{ COPY_PY_XXX #define COPY_PY_LIST(TYPE, NAME) \ { \ for (auto it: py_##NAME) \ NAME.push_back(py::cast(it)); \ } #define COPY_PY_ARRAY(FUNC_NAME, TYPE, NAME, COUNTER) \ { \ COUNTER = 0; \ for (auto it: py_##NAME) \ { \ if (COUNTER == NAME.size()) \ throw error(FUNC_NAME, \ CL_INVALID_VALUE, "too many entries in " #NAME " argument"); \ NAME[COUNTER++] = py::cast(it); \ } \ } #define COPY_PY_COORD_TRIPLE(NAME) \ size_t NAME[3] = {0, 0, 0}; \ { \ py::sequence py_seq_##NAME = py::cast(py_##NAME); \ size_t my_len = len(py_seq_##NAME); \ if (my_len > 3) \ throw error("transfer", CL_INVALID_VALUE, #NAME "has too many components"); \ for (size_t i = 0; i < my_len; ++i) \ NAME[i] = py::cast(py_seq_##NAME[i]); \ } #define COPY_PY_PITCH_TUPLE(NAME) \ size_t NAME[2] = {0, 0}; \ if (py_##NAME.ptr() != Py_None) \ { \ py::sequence py_seq_##NAME = py::cast(py_##NAME); \ size_t my_len = len(py_seq_##NAME); \ if (my_len > 2) \ throw error("transfer", CL_INVALID_VALUE, #NAME "has too many components"); \ for (size_t i = 0; i < my_len; ++i) \ NAME[i] = py::cast(py_seq_##NAME[i]); \ } #define COPY_PY_REGION_TRIPLE(NAME) \ size_t NAME[3] = {1, 1, 1}; \ { \ py::sequence py_seq_##NAME = py::cast(py_##NAME); \ size_t my_len = len(py_seq_##NAME); \ if (my_len > 3) \ throw error("transfer", CL_INVALID_VALUE, #NAME "has too many components"); \ for (size_t i = 0; i < my_len; ++i) \ NAME[i] = py::cast(py_seq_##NAME[i]); \ } // }}} #define PYOPENCL_PARSE_NUMPY_ARRAY_SPEC \ PyArray_Descr *tp_descr; \ if (PyArray_DescrConverter(dtype.ptr(), &tp_descr) != NPY_SUCCEED) \ throw py::python_error(); \ \ std::vector shape; \ try \ { \ shape.push_back(py::cast(py_shape)); \ } \ catch (py::cast_error &) \ { \ COPY_PY_LIST(npy_intp, shape); \ } \ \ NPY_ORDER order = NPY_CORDER; \ PyArray_OrderConverter(py_order.ptr(), &order); \ \ int ary_flags = 0; \ if (order == NPY_FORTRANORDER) \ ary_flags |= NPY_FARRAY; \ else if (order == NPY_CORDER) \ ary_flags |= NPY_CARRAY; \ else \ throw std::runtime_error("unrecognized order specifier"); \ \ std::vector strides; \ if (py_strides.ptr() != Py_None) \ { \ COPY_PY_LIST(npy_intp, strides); \ } #define PYOPENCL_RETURN_VECTOR(ITEMTYPE, NAME) \ { \ py::list pyopencl_result; \ for (ITEMTYPE item: NAME) \ pyopencl_result.append(item); \ return pyopencl_result; \ } namespace { template inline py::object handle_from_new_ptr(T *ptr) { return py::cast(ptr, py::rv_policy::take_ownership); } template inline T *from_int_ptr(intptr_t obj_ref, bool retain) { ClType clobj = (ClType) obj_ref; return new T(clobj, retain); } template inline intptr_t to_int_ptr(T const &obj) { return (intptr_t) obj.data(); } } #define PYOPENCL_EXPOSE_TO_FROM_INT_PTR(CL_TYPENAME) \ .def_static("from_int_ptr", from_int_ptr, \ py::arg("int_ptr_value"), \ py::arg("retain")=true, \ "(static method) Return a new Python object referencing the C-level " \ ":c:type:`" #CL_TYPENAME "` object at the location pointed to " \ "by *int_ptr_value*. The relevant ``clRetain*`` function " \ "will be called if *retain* is True." \ "If the previous owner of the object will *not* release the reference, " \ "*retain* should be set to *False*, to effectively transfer ownership to " \ ":mod:`pyopencl`." \ "\n\n.. versionadded:: 2013.2\n" \ "\n\n.. versionchanged:: 2016.1\n\n *retain* added.") \ .def_prop_ro("int_ptr", to_int_ptr, \ "Return an integer corresponding to the pointer value " \ "of the underlying :c:type:`" #CL_TYPENAME "`. " \ "Use :meth:`from_int_ptr` to turn back into a Python object." \ "\n\n.. versionadded:: 2013.2\n") \ #define PYOPENCL_EXPOSE_EQUALITY_TESTS \ /* this relies on nanobind overload resolution going in order of registration */ \ .def("__eq__", [](cls const &self, cls const &other) { return self == other; }) \ .def("__eq__", [](cls const &self, py::object obj) { return false; }, py::arg("obj").none()) #endif // vim: foldmethod=marker pyopencl-2025.1/src/wrap_mempool.cpp0000644000000000000000000004604114332717401014365 0ustar00// Wrap memory pool // // Copyright (C) 2009 Andreas Kloeckner // // Permission is hereby granted, free of charge, to any person // obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights to use, // copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the // Software is furnished to do so, subject to the following // conditions: // // The above copyright notice and this permission notice shall be // included in all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. // Gregor Thalhammer (on Apr 13, 2011) said it's necessary to import Python.h // first to prevent OS X from overriding a bunch of macros. (e.g. isspace) #include #define NO_IMPORT_ARRAY #define PY_ARRAY_UNIQUE_SYMBOL pyopencl_ARRAY_API #include #include #include "wrap_helpers.hpp" #include "wrap_cl.hpp" #include "mempool.hpp" #include "tools.hpp" namespace pyopencl { // {{{ test_allocator class test_allocator : public py::intrusive_base { public: typedef void *pointer_type; typedef size_t size_type; bool is_deferred() const { return false; } pointer_type allocate(size_type s) { return nullptr; } pointer_type hand_out_existing_block(pointer_type &&p) { return p; } ~test_allocator() { } void free(pointer_type &&p) { } void try_release_blocks() { } }; // }}} // {{{ buffer allocators class buffer_allocator_base : public py::intrusive_base { protected: py::ref m_context; cl_mem_flags m_flags; public: buffer_allocator_base(py::ref const &ctx, cl_mem_flags flags=CL_MEM_READ_WRITE) : m_context(ctx), m_flags(flags) { if (flags & (CL_MEM_USE_HOST_PTR | CL_MEM_COPY_HOST_PTR)) throw pyopencl::error("Allocator", CL_INVALID_VALUE, "cannot specify USE_HOST_PTR or COPY_HOST_PTR flags"); } buffer_allocator_base(buffer_allocator_base const &src) : m_context(src.m_context), m_flags(src.m_flags) { } virtual ~buffer_allocator_base() { } typedef cl_mem pointer_type; typedef size_t size_type; virtual bool is_deferred() const = 0; virtual pointer_type allocate(size_type s) = 0; pointer_type hand_out_existing_block(pointer_type &&p) { return p; } void free(pointer_type &&p) { PYOPENCL_CALL_GUARDED(clReleaseMemObject, (p)); } void try_release_blocks() { pyopencl::run_python_gc(); } }; class deferred_buffer_allocator : public buffer_allocator_base { private: typedef buffer_allocator_base super; public: deferred_buffer_allocator(py::ref const &ctx, cl_mem_flags flags=CL_MEM_READ_WRITE) : super(ctx, flags) { } bool is_deferred() const { return true; } pointer_type allocate(size_type s) { if (s == 0) return nullptr; return pyopencl::create_buffer(m_context->data(), m_flags, s, 0); } }; class immediate_buffer_allocator : public buffer_allocator_base { private: typedef buffer_allocator_base super; pyopencl::command_queue m_queue; public: immediate_buffer_allocator(pyopencl::command_queue &queue, cl_mem_flags flags=CL_MEM_READ_WRITE) : super(queue.get_context(), flags), m_queue(queue.data(), /*retain*/ true) { } immediate_buffer_allocator(immediate_buffer_allocator const &src) : super(src), m_queue(src.m_queue) { } bool is_deferred() const { return false; } pointer_type allocate(size_type s) { if (s == 0) return nullptr; pointer_type ptr = pyopencl::create_buffer( m_context->data(), m_flags, s, 0); // Make sure the buffer gets allocated right here and right now. // This looks (and is) expensive. But immediate allocators // have their main use in memory pools, whose basic assumption // is that allocation is too expensive anyway--but they rely // on 'out-of-memory' being reported on allocation. (If it is // reported in a deferred manner, it has no way to react // (e.g. by freeing unused memory) because it is not part of // the call stack.) if (m_queue.get_hex_device_version() < 0x1020) { unsigned zero = 0; PYOPENCL_CALL_GUARDED(clEnqueueWriteBuffer, ( m_queue.data(), ptr, /* is blocking */ CL_FALSE, 0, std::min(s, sizeof(zero)), &zero, 0, NULL, NULL )); } else { PYOPENCL_CALL_GUARDED(clEnqueueMigrateMemObjects, ( m_queue.data(), 1, &ptr, CL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED, 0, NULL, NULL )); } // No need to wait for completion here. clWaitForEvents (e.g.) // cannot return mem object allocation failures. This implies that // the buffer is faulted onto the device on enqueue. return ptr; } }; // }}} // {{{ pooled_buffer class pooled_buffer : public pyopencl::pooled_allocation >, public pyopencl::memory_object_holder { private: typedef pyopencl::pooled_allocation > super; public: pooled_buffer( py::ref p, super::size_type s) : super(p, s) { } virtual ~pooled_buffer() { } const super::pointer_type data() const { return m_ptr; } size_t size() const { return m_size; } // This shouldn't be necessary, but somehow nanobind gets unhappy if // it's not there. void free() { super::free(); } }; // }}} // {{{ allocate_from_buffer_allocator inline buffer *allocate_from_buffer_allocator(buffer_allocator_base &alloc, size_t size) { cl_mem mem = nullptr; int try_count = 0; while (try_count < 2) { try { mem = alloc.allocate(size); break; } catch (pyopencl::error &e) { if (!e.is_out_of_memory()) throw; if (++try_count == 2) throw; } alloc.try_release_blocks(); } if (!mem) { if (size == 0) return nullptr; else throw pyopencl::error("Allocator", CL_INVALID_VALUE, "allocator succeeded but returned NULL cl_mem"); } try { return new pyopencl::buffer(mem, false); } catch (...) { PYOPENCL_CALL_GUARDED(clReleaseMemObject, (mem)); throw; } } // }}} // {{{ allocate_from_buffer_pool pooled_buffer *allocate_from_buffer_pool( py::ref > pool, memory_pool::size_type sz) { return new pooled_buffer(pool, sz); } // }}} #if PYOPENCL_CL_VERSION >= 0x2000 struct svm_held_pointer { void *ptr; pyopencl::command_queue_ref queue; }; // {{{ svm allocator class svm_allocator : public py::intrusive_base { public: typedef svm_held_pointer pointer_type; typedef size_t size_type; protected: py::ref m_context; cl_uint m_alignment; cl_svm_mem_flags m_flags; pyopencl::command_queue_ref m_queue; public: svm_allocator(py::ref const &ctx, cl_uint alignment=0, cl_svm_mem_flags flags=CL_MEM_READ_WRITE, pyopencl::command_queue *queue=nullptr) : m_context(ctx), m_alignment(alignment), m_flags(flags) { if (queue) m_queue.set(queue->data()); } svm_allocator(svm_allocator const &src) : m_context(src.m_context), m_alignment(src.m_alignment), m_flags(src.m_flags) { } ~svm_allocator() { } bool is_deferred() const { // According to experiments with the Nvidia implementation (and based // on my reading of the CL spec), clSVMalloc will return an error // immediately upon being out of memory. Therefore the // immediate/deferred split on the buffer side is not needed here. // -AK, 2022-09-07 return false; } py::ref context() const { return m_context; } pointer_type allocate(size_type size) { if (size == 0) return { nullptr, nullptr }; PYOPENCL_PRINT_CALL_TRACE("clSVMalloc"); return { clSVMAlloc(m_context->data(), m_flags, size, m_alignment), pyopencl::command_queue_ref(m_queue.is_valid() ? m_queue.data() : nullptr) }; } pointer_type hand_out_existing_block(pointer_type &&p) { if (m_queue.is_valid()) { if (p.queue.is_valid()) { if (p.queue.data() != m_queue.data()) { // make sure synchronization promises stay valid in new queue cl_event evt; PYOPENCL_CALL_GUARDED(clEnqueueMarker, (p.queue.data(), &evt)); PYOPENCL_CALL_GUARDED(clEnqueueMarkerWithWaitList, (m_queue.data(), 1, &evt, nullptr)); } } p.queue.set(m_queue.data()); } else { if (p.queue.is_valid()) { PYOPENCL_CALL_GUARDED_THREADED(clFinish, (p.queue.data())); p.queue.reset(); } } return std::move(p); } void free(pointer_type &&p) { if (p.queue.is_valid()) { PYOPENCL_CALL_GUARDED_CLEANUP(clEnqueueSVMFree, ( p.queue.data(), 1, &p.ptr, nullptr, nullptr, 0, nullptr, nullptr)); p.queue.reset(); } else { PYOPENCL_PRINT_CALL_TRACE("clSVMFree"); clSVMFree(m_context->data(), p.ptr); } } void try_release_blocks() { pyopencl::run_python_gc(); } }; // }}} // {{{ pooled_svm class pooled_svm : public pyopencl::pooled_allocation>, public pyopencl::svm_pointer { private: typedef pyopencl::pooled_allocation> super; public: pooled_svm( py::ref p, super::size_type s) : super(p, s) { } virtual ~pooled_svm() { } void *svm_ptr() const { return m_ptr.ptr; } size_t size() const { return m_size; } void bind_to_queue(pyopencl::command_queue const &queue) { if (pyopencl::is_queue_out_of_order(queue.data())) throw pyopencl::error("PooledSVM.bind_to_queue", CL_INVALID_VALUE, "supplying an out-of-order queue to SVMAllocation is invalid"); if (m_ptr.queue.is_valid()) { if (m_ptr.queue.data() != queue.data()) { // make sure synchronization promises stay valid in new queue cl_event evt; PYOPENCL_CALL_GUARDED(clEnqueueMarker, (m_ptr.queue.data(), &evt)); PYOPENCL_CALL_GUARDED(clEnqueueMarkerWithWaitList, (queue.data(), 1, &evt, nullptr)); } } m_ptr.queue.set(queue.data()); } void unbind_from_queue() { if (m_ptr.queue.is_valid()) PYOPENCL_CALL_GUARDED_THREADED(clFinish, (m_ptr.queue.data())); m_ptr.queue.reset(); } // only use for testing/diagnostic/debugging purposes! cl_command_queue queue() const { if (m_ptr.queue.is_valid()) return m_ptr.queue.data(); else return nullptr; } // This shouldn't be necessary, but somehow nanobind gets unhappy if // it's not there. void free() { super::free(); } }; // }}} // {{{ svm_allocator_call inline pyopencl::svm_allocation *svm_allocator_call(svm_allocator &alloc, size_t size) { int try_count = 0; while (true) { try { svm_held_pointer mem(alloc.allocate(size)); if (mem.queue.is_valid()) return new pyopencl::svm_allocation( alloc.context(), mem.ptr, size, mem.queue.data()); else return new pyopencl::svm_allocation( alloc.context(), mem.ptr, size, nullptr); } catch (pyopencl::error &e) { if (!e.is_out_of_memory()) throw; if (++try_count == 2) throw; } alloc.try_release_blocks(); } } // }}} // {{{ allocate_from_svm_pool pooled_svm *allocate_from_svm_pool( py::ref > pool, pyopencl::memory_pool::size_type sz) { return new pooled_svm(pool, sz); } // }}} #endif } namespace { template void expose_memory_pool(Wrapper &wrapper) { typedef typename Wrapper::Type cls; wrapper .def_prop_ro("held_blocks", &cls::held_blocks) .def_prop_ro("active_blocks", &cls::active_blocks) .def_prop_ro("managed_bytes", &cls::managed_bytes) .def_prop_ro("active_bytes", &cls::active_bytes) .DEF_SIMPLE_METHOD(bin_number) .DEF_SIMPLE_METHOD(alloc_size) .DEF_SIMPLE_METHOD(free_held) .DEF_SIMPLE_METHOD(stop_holding) // undoc for now .def("_set_trace", &cls::set_trace) ; } } void pyopencl_expose_mempool(py::module_ &m) { m.def("bitlog2", pyopencl::bitlog2); { typedef pyopencl::buffer_allocator_base cls; py::class_ wrapper( m, "AllocatorBase", py::intrusive_ptr( [](cls *o, PyObject *po) noexcept { o->set_self_py(po); }) ); wrapper .def("__call__", pyopencl::allocate_from_buffer_allocator, py::arg("size")) ; } { typedef pyopencl::memory_pool cls; py::class_ wrapper( m, "_TestMemoryPool", py::intrusive_ptr( [](cls *o, PyObject *po) noexcept { o->set_self_py(po); }) ); wrapper .def("__init__", [](cls *self, unsigned leading_bits_in_bin_id) { new (self) cls( py::ref( new pyopencl::test_allocator()), leading_bits_in_bin_id); }, py::arg("leading_bits_in_bin_id")=4 ) .def("allocate", [](py::ref pool, cls::size_type sz) { pool->allocate(sz); return py::none(); }) ; expose_memory_pool(wrapper); } { typedef pyopencl::deferred_buffer_allocator cls; py::class_ wrapper( m, "DeferredAllocator"); wrapper .def(py::init const &>()) .def(py::init< py::ref const &, cl_mem_flags>(), py::arg("queue"), py::arg("mem_flags")) ; } { typedef pyopencl::immediate_buffer_allocator cls; py::class_ wrapper( m, "ImmediateAllocator"); wrapper .def(py::init()) .def(py::init(), py::arg("queue"), py::arg("mem_flags")) ; } { typedef pyopencl::pooled_buffer cls; py::class_(m, "PooledBuffer") .def("release", &cls::free) .def("bind_to_queue", [](cls &self, pyopencl::command_queue &queue) { /* no-op */ }) .def("unbind_from_queue", [](cls &self) { /* no-op */ }) ; } { typedef pyopencl::memory_pool cls; py::class_ wrapper( m, "MemoryPool", py::intrusive_ptr( [](cls *o, PyObject *po) noexcept { o->set_self_py(po); }) ); wrapper .def(py::init, unsigned>(), py::arg("allocator"), py::arg("leading_bits_in_bin_id")=4 ) .def("allocate", pyopencl::allocate_from_buffer_pool, py::arg("size")) .def("__call__", pyopencl::allocate_from_buffer_pool, py::arg("size")) ; expose_memory_pool(wrapper); } #if PYOPENCL_CL_VERSION >= 0x2000 { typedef pyopencl::svm_allocator cls; py::class_ wrapper( m, "SVMAllocator", py::intrusive_ptr( [](cls *o, PyObject *po) noexcept { o->set_self_py(po); }) ); wrapper .def(py::init const &, cl_uint, cl_uint, pyopencl::command_queue *>(), py::arg("context"), /* py::kw_only(), */ py::arg("alignment")=0, py::arg("flags")=CL_MEM_READ_WRITE, py::arg("queue").none(true)=nullptr ) .def("__call__", pyopencl::svm_allocator_call, py::arg("size")) ; } { typedef pyopencl::pooled_svm cls; py::class_(m, "PooledSVM") .def("release", &cls::free) .def("enqueue_release", &cls::free) .def("__eq__", [](const cls &self, const cls &other) { return self.svm_ptr() == other.svm_ptr(); }) .def("__hash__", [](cls &self) { return (intptr_t) self.svm_ptr(); }) .DEF_SIMPLE_METHOD(bind_to_queue) .DEF_SIMPLE_METHOD(unbind_from_queue) // only for diagnostic/debugging/testing purposes! .def_prop_ro("_queue", [](cls const &self) -> py::object { cl_command_queue queue = self.queue(); if (queue) return py::cast(new pyopencl::command_queue(queue, true)); else return py::none(); }) ; } { typedef pyopencl::memory_pool cls; py::class_ wrapper( m, "SVMPool", py::intrusive_ptr( [](cls *o, PyObject *po) noexcept { o->set_self_py(po); }) ); wrapper .def(py::init, unsigned>(), py::arg("allocator"), /* py::kw_only(), */ py::arg("leading_bits_in_bin_id")=4 ) .def("__call__", pyopencl::allocate_from_svm_pool, py::arg("size")) ; expose_memory_pool(wrapper); } #endif } // vim: foldmethod=marker pyopencl-2025.1/test/add-vectors-32.spv0000644000000000000000000000143014332717401014530 0ustar00# OpenCL.std sum@  __spirv_BuiltInGlobalInvocationId a_g b_g res_gentrycallarrayidxarrayidx1addarrayidx2G&IG GG )__spirv_BuiltInGlobalInvocationIdJ    ! ;6 7 7 7 =QF =F =F >8pyopencl-2025.1/test/add-vectors-64.spv0000644000000000000000000000167014332717401014543 0ustar00# OpenCL.std sum@  __spirv_BuiltInGlobalInvocationId a_g b_g res_gentrycallconvidxpromarrayidxidxprom1arrayidx2addidxprom3arrayidx4G&IG GG )__spirv_BuiltInGlobalInvocationIdJ @   ! ;6 7 7 7 =QqrF =rF =rF >8pyopencl-2025.1/test/empty-header.h0000644000000000000000000000003314332717401014074 0ustar00/* what did you expect? */ pyopencl-2025.1/test/test_algorithm.py0000644000000000000000000010503514332717401014746 0ustar00__copyright__ = "Copyright (C) 2013 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import sys import numpy as np import numpy.linalg as la import pytest from test_array import general_clrand from pytools import memoize import pyopencl as cl import pyopencl.array from pyopencl.characterize import ( get_pocl_version, has_double_support, has_struct_arg_count_bug, ) from pyopencl.scan import ( ExclusiveScanKernel, GenericDebugScanKernel, GenericScanKernel, InclusiveScanKernel, ) from pyopencl.tools import ( pytest_generate_tests_for_pyopencl as pytest_generate_tests, # noqa: F401 ) # {{{ elementwise def test_elwise_kernel(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand a_gpu = clrand(queue, (50,), np.float32) b_gpu = clrand(queue, (50,), np.float32) from pyopencl.elementwise import ElementwiseKernel lin_comb = ElementwiseKernel(context, "float a, float *x, float b, float *y, float *z", "z[i] = a*x[i] + b*y[i]", "linear_combination") c_gpu = cl.array.empty_like(a_gpu) lin_comb(5, a_gpu, 6, b_gpu, c_gpu) assert la.norm((c_gpu - (5 * a_gpu + 6 * b_gpu)).get()) < 1e-5 def test_elwise_kernel_with_options(ctx_factory): from pyopencl.clrandom import rand as clrand from pyopencl.elementwise import ElementwiseKernel context = ctx_factory() queue = cl.CommandQueue(context) in_gpu = clrand(queue, (50,), np.float32) options = ["-D", "ADD_ONE"] add_one = ElementwiseKernel( context, "float* out, const float *in", """ out[i] = in[i] #ifdef ADD_ONE +1 #endif ; """, options=options, ) out_gpu = cl.array.empty_like(in_gpu) add_one(out_gpu, in_gpu) gt = in_gpu.get() + 1 gv = out_gpu.get() assert la.norm(gv - gt) < 1e-5 def test_ranged_elwise_kernel(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.elementwise import ElementwiseKernel set_to_seven = ElementwiseKernel(context, "float *z", "z[i] = 7", "set_to_seven") for _i, slc in enumerate([ slice(5, 20000), slice(5, 20000, 17), slice(3000, 5, -1), slice(1000, -1), ]): a_gpu = cl.array.zeros(queue, (50000,), dtype=np.float32) a_cpu = np.zeros(a_gpu.shape, a_gpu.dtype) a_cpu[slc] = 7 set_to_seven(a_gpu, slice=slc) assert (a_cpu == a_gpu.get()).all() def test_take(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) idx = cl.array.arange(queue, 0, 200000, 2, dtype=np.uint32) a = cl.array.arange(queue, 0, 600000, 3, dtype=np.float32) result = cl.array.take(a, idx) assert ((3 * idx).get() == result.get()).all() def test_arange(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) n = 5000 a = cl.array.arange(queue, n, dtype=np.float32) assert (np.arange(n, dtype=np.float32) == a.get()).all() def test_reverse(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) n = 5000 a = np.arange(n).astype(np.float32) a_gpu = cl.array.to_device(queue, a) a_gpu = a_gpu.reverse() assert (a[::-1] == a_gpu.get()).all() def test_if_positive(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand ary_len = 20000 a_gpu = clrand(queue, (ary_len,), np.float32) b_gpu = clrand(queue, (ary_len,), np.float32) a = a_gpu.get() b = b_gpu.get() max_a_b_gpu = cl.array.maximum(a_gpu, b_gpu) min_a_b_gpu = cl.array.minimum(a_gpu, b_gpu) print(max_a_b_gpu) print(np.maximum(a, b)) assert la.norm(max_a_b_gpu.get() - np.maximum(a, b)) == 0 assert la.norm(min_a_b_gpu.get() - np.minimum(a, b)) == 0 def test_take_put(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) for n in [5, 17, 333]: one_field_size = 8 buf_gpu = cl.array.zeros(queue, n * one_field_size, dtype=np.float32) dest_indices = cl.array.to_device(queue, np.array([0, 1, 2, 3, 32, 33, 34, 35], dtype=np.uint32)) read_map = cl.array.to_device(queue, np.array([7, 6, 5, 4, 3, 2, 1, 0], dtype=np.uint32)) cl.array.multi_take_put( arrays=[buf_gpu for i in range(n)], dest_indices=dest_indices, src_indices=read_map, src_offsets=[i * one_field_size for i in range(n)], dest_shape=(96,)) def test_astype(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand if not has_double_support(context.devices[0]): from pytest import skip skip("double precision not supported on %s" % context.devices[0]) a_gpu = clrand(queue, (2000,), dtype=np.float32) a = a_gpu.get().astype(np.float64) a2 = a_gpu.astype(np.float64).get() assert a2.dtype == np.float64 assert la.norm(a - a2) == 0, (a, a2) a_gpu = clrand(queue, (2000,), dtype=np.float64) a = a_gpu.get().astype(np.float32) a2 = a_gpu.astype(np.float32).get() assert a2.dtype == np.float32 assert la.norm(a - a2) / la.norm(a) < 1e-7 # }}} # {{{ reduction def test_sum(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) n = 200000 for dtype in [np.float32, np.complex64]: a_gpu = general_clrand(queue, (n,), dtype) a = a_gpu.get() for slc in [ slice(None), slice(1000, 3000), slice(1000, -3000), slice(1000, None), slice(1000, None, 3), slice(1000, 1000), ]: sum_a = np.sum(a[slc]) if sum_a: ref_divisor = abs(sum_a) else: ref_divisor = 1 if slc.step is None: sum_a_gpu = cl.array.sum(a_gpu[slc]).get() assert abs(sum_a_gpu - sum_a) / ref_divisor < 1e-4 sum_a_gpu_2 = cl.array.sum(a_gpu, slice=slc).get() assert abs(sum_a_gpu_2 - sum_a) / ref_divisor < 1e-4 def test_sum_without_data(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) n = 2000 from pyopencl.reduction import ReductionKernel red = ReductionKernel(context, np.int32, neutral="0", reduce_expr="a+b", map_expr="i", arguments=[]) result_dev = red(range=slice(n), queue=queue).get() result_ref = n*(n-1)//2 assert result_dev == result_ref def test_reduction_not_first_argument(ctx_factory): # https://github.com/inducer/pyopencl/issues/535 from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) n = 400 a = cl.array.arange(queue, n, dtype=np.float32) b = cl.array.arange(queue, n, dtype=np.float32) from pyopencl.reduction import ReductionKernel krnl = ReductionKernel(context, np.float32, neutral="0", reduce_expr="a+b", map_expr="z*x[i]*y[i]", arguments="float z, __global float *x, __global float *y") my_dot_prod = krnl(0.1, a, b).get() assert abs(my_dot_prod - 0.1*np.sum(np.arange(n)**2)) < 1e-4 def test_minmax(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand if has_double_support(context.devices[0]): dtypes = [np.float64, np.float32, np.int32] else: dtypes = [np.float32, np.int32] for what in ["min", "max"]: for dtype in dtypes: a_gpu = clrand(queue, (200000,), dtype) a = a_gpu.get() op_a = getattr(np, what)(a) op_a_gpu = getattr(cl.array, what)(a_gpu).get() assert op_a_gpu == op_a, (op_a_gpu, op_a, dtype, what) def test_subset_minmax(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand l_a = 200000 gran = 5 l_m = l_a - l_a // gran + 1 if has_double_support(context.devices[0]): dtypes = [np.float64, np.float32, np.int32] else: dtypes = [np.float32, np.int32] for dtype in dtypes: a_gpu = clrand(queue, (l_a,), dtype) a = a_gpu.get() meaningful_indices_gpu = cl.array.zeros( queue, l_m, dtype=np.int32) meaningful_indices = meaningful_indices_gpu.get() j = 0 for i in range(len(meaningful_indices)): meaningful_indices[i] = j j = j + 1 if j % gran == 0: j = j + 1 meaningful_indices_gpu = cl.array.to_device( queue, meaningful_indices) b = a[meaningful_indices] min_a = np.min(b) min_a_gpu = cl.array.subset_min(meaningful_indices_gpu, a_gpu).get() assert min_a_gpu == min_a def test_dot(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) dev = context.devices[0] dtypes = [np.float32, np.complex64] if has_double_support(dev): if has_struct_arg_count_bug(dev) == "apple": dtypes.extend([np.float64]) else: dtypes.extend([np.float64, np.complex128]) for a_dtype in dtypes: for b_dtype in dtypes: print(a_dtype, b_dtype) a_gpu = general_clrand(queue, (200000,), a_dtype) a = a_gpu.get() b_gpu = general_clrand(queue, (200000,), b_dtype) b = b_gpu.get() dot_ab = np.dot(a, b) dot_ab_gpu = cl.array.dot(a_gpu, b_gpu).get() assert abs(dot_ab_gpu - dot_ab) / abs(dot_ab) < 1e-4 try: vdot_ab = np.vdot(a, b) except NotImplementedError: import sys is_pypy = "__pypy__" in sys.builtin_module_names if is_pypy: print("PYPY: VDOT UNIMPLEMENTED") continue else: raise vdot_ab_gpu = cl.array.vdot(a_gpu, b_gpu).get() rel_err = abs(vdot_ab_gpu - vdot_ab) / abs(vdot_ab) assert rel_err < 1e-4, rel_err @memoize def make_mmc_dtype(device): dtype = np.dtype([ ("cur_min", np.int32), ("cur_max", np.int32), ("pad", np.int32), ]) name = "minmax_collector" from pyopencl.tools import get_or_register_dtype, match_dtype_to_c_struct dtype, c_decl = match_dtype_to_c_struct(device, name, dtype) dtype = get_or_register_dtype(name, dtype) return dtype, c_decl def test_struct_reduce(ctx_factory): pytest.importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) dev, = context.devices if (dev.vendor == "NVIDIA" and dev.platform.vendor == "Apple" and dev.driver_version == "8.12.47 310.40.00.05f01"): pytest.skip("causes a compiler hang on Apple/Nv GPU") mmc_dtype, mmc_c_decl = make_mmc_dtype(context.devices[0]) preamble = mmc_c_decl + r"""//CL// minmax_collector mmc_neutral() { // FIXME: needs infinity literal in real use, ok here minmax_collector result; result.cur_min = 1<<30; result.cur_max = -(1<<30); return result; } minmax_collector mmc_from_scalar(float x) { minmax_collector result; result.cur_min = x; result.cur_max = x; return result; } minmax_collector agg_mmc(minmax_collector a, minmax_collector b) { minmax_collector result = a; if (b.cur_min < result.cur_min) result.cur_min = b.cur_min; if (b.cur_max > result.cur_max) result.cur_max = b.cur_max; return result; } """ from pyopencl.clrandom import rand as clrand a_gpu = clrand(queue, (20000,), dtype=np.int32, a=0, b=10**6) a = a_gpu.get() from pyopencl.reduction import ReductionKernel red = ReductionKernel(context, mmc_dtype, neutral="mmc_neutral()", reduce_expr="agg_mmc(a, b)", map_expr="mmc_from_scalar(x[i])", arguments="__global int *x", preamble=preamble) minmax = red(a_gpu).get() # print(minmax["cur_min"], minmax["cur_max"]) # print(np.min(a), np.max(a)) assert abs(minmax["cur_min"] - np.min(a)) < 1e-5 assert abs(minmax["cur_max"] - np.max(a)) < 1e-5 # }}} # {{{ scan-related def summarize_error(obtained, desired, orig, thresh=1e-5): from pytest import importorskip importorskip("mako") err = obtained - desired ok_count = 0 bad_count = 0 bad_limit = 200 def summarize_counts(): if ok_count: entries.append("<%d ok>" % ok_count) if bad_count >= bad_limit: entries.append("<%d more bad>" % (bad_count-bad_limit)) entries = [] for i, val in enumerate(err): if abs(val) > thresh: if ok_count: summarize_counts() ok_count = 0 bad_count += 1 if bad_count < bad_limit: entries.append("{!r} (want: {!r}, got: {!r}, orig: {!r})".format( obtained[i], desired[i], obtained[i], orig[i])) else: if bad_count: summarize_counts() bad_count = 0 ok_count += 1 summarize_counts() return " ".join(entries) scan_test_counts = [ 10, 2 ** 8 - 1, 2 ** 8, 2 ** 8 + 1, 2 ** 10 - 5, 2 ** 10, 2 ** 10 + 5, 2 ** 12 - 5, 2 ** 12, 2 ** 12 + 5, 2 ** 20 - 2 ** 18, 2 ** 20 - 2 ** 18 + 5, 2 ** 20 + 1, 2 ** 20, 2 ** 23 + 3, # larger sizes cause out of memory on low-end AMD APUs ] @pytest.mark.parametrize("dtype", [np.int32, np.int64]) @pytest.mark.parametrize("scan_cls", [InclusiveScanKernel, ExclusiveScanKernel]) def test_scan(ctx_factory, dtype, scan_cls): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) knl = scan_cls(context, dtype, "a+b", "0") rng = np.random.default_rng(seed=42) for n in scan_test_counts: host_data = rng.integers(0, 10, n, dtype=dtype) dev_data = cl.array.to_device(queue, host_data) # /!\ fails on Nv GT2?? for some drivers assert (host_data == dev_data.get()).all() knl(dev_data) desired_result = np.cumsum(host_data, axis=0) if scan_cls is ExclusiveScanKernel: desired_result -= host_data is_ok = (dev_data.get() == desired_result).all() if 1 and not is_ok: print("something went wrong, summarizing error...") print(summarize_error(dev_data.get(), desired_result, host_data)) print("dtype:%s n:%d %s worked:%s" % (dtype, n, scan_cls, is_ok)) assert is_ok from gc import collect collect() @pytest.mark.parametrize("scan_cls", (GenericScanKernel, GenericDebugScanKernel)) def test_scan_with_vectorargs_with_offsets(ctx_factory, scan_cls): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.tools import VectorArg knl = scan_cls( context, float, arguments=[ VectorArg(float, "input", with_offset=True), VectorArg(int, "segment", with_offset=True), ], input_expr="input[i]", is_segment_start_expr="segment[i]", scan_expr="a+b", neutral="0", output_statement=""" input[i] = item; """) n = 20 rng = np.random.default_rng(seed=42) host_data = rng.integers(0, 10, n).astype(np.float64) dev_data = cl.array.to_device(queue, host_data) segment_data = np.zeros(n, dtype=int) dev_segment_data = cl.array.to_device(queue, segment_data) knl(dev_data, dev_segment_data) assert (dev_data.get() == np.cumsum(host_data)).all() def test_copy_if(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand for n in scan_test_counts: a_dev = clrand(queue, (n,), dtype=np.int32, a=0, b=1000) a = a_dev.get() from pyopencl.algorithm import copy_if crit = a_dev.dtype.type(300) selected = a[a > crit] selected_dev, count_dev, _evt = copy_if( a_dev, "ary[i] > myval", [("myval", crit)]) assert (selected_dev.get()[:count_dev.get()] == selected).all() from gc import collect collect() def test_partition(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand for n in scan_test_counts: print("part", n) a_dev = clrand(queue, (n,), dtype=np.int32, a=0, b=1000) a = a_dev.get() crit = a_dev.dtype.type(300) true_host = a[a > crit] false_host = a[a <= crit] from pyopencl.algorithm import partition true_dev, false_dev, count_true_dev, _evt = partition( a_dev, "ary[i] > myval", [("myval", crit)]) count_true_dev = count_true_dev.get() assert (true_dev.get()[:count_true_dev] == true_host).all() assert (false_dev.get()[:n-count_true_dev] == false_host).all() def test_unique(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand for n in scan_test_counts: a_dev = clrand(queue, (n,), dtype=np.int32, a=0, b=1000) a = a_dev.get() a = np.sort(a) a_dev = cl.array.to_device(queue, a) a_unique_host = np.unique(a) from pyopencl.algorithm import unique a_unique_dev, count_unique_dev, _evt = unique(a_dev) count_unique_dev = count_unique_dev.get() assert (a_unique_dev.get()[:count_unique_dev] == a_unique_host).all() from gc import collect collect() def test_index_preservation(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) classes = [GenericScanKernel] dev = context.devices[0] if dev.type & cl.device_type.CPU: classes.append(GenericDebugScanKernel) for cls in classes: for n in scan_test_counts: knl = cls( context, np.int32, arguments="__global int *out", input_expr="i", scan_expr="b", neutral="0", output_statement=""" out[i] = item; """) out = cl.array.empty(queue, n, dtype=np.int32) knl(out) assert (out.get() == np.arange(n)).all() from gc import collect collect() def test_segmented_scan(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.tools import dtype_to_ctype dtype = np.int32 ctype = dtype_to_ctype(dtype) # for is_exclusive in [False, True]: for is_exclusive in [True, False]: if is_exclusive: output_statement = "out[i] = prev_item" else: output_statement = "out[i] = item" knl = GenericScanKernel(context, dtype, arguments="__global %s *ary, __global char *segflags, " "__global %s *out" % (ctype, ctype), input_expr="ary[i]", scan_expr="across_seg_boundary ? b : (a+b)", neutral="0", is_segment_start_expr="segflags[i]", output_statement=output_statement, options=[]) np.set_printoptions(threshold=2000) from random import randrange from pyopencl.clrandom import rand as clrand for n in scan_test_counts: a_dev = clrand(queue, (n,), dtype=dtype, a=0, b=10) a = a_dev.get() if 10 <= n < 20: seg_boundaries_values = [ [0, 9], [0, 3], [4, 6], ] else: seg_boundaries_values = [] for i in range(10): seg_boundary_count = max(2, min(100, randrange(0, int(0.4*n)))) seg_boundaries = [ randrange(n) for i in range(seg_boundary_count)] if n >= 1029: seg_boundaries.insert(0, 1028) seg_boundaries.sort() seg_boundaries_values.append(seg_boundaries) for seg_boundaries in seg_boundaries_values: # print("BOUNDARIES", seg_boundaries) # print(a) seg_boundary_flags = np.zeros(n, dtype=np.uint8) seg_boundary_flags[seg_boundaries] = 1 seg_boundary_flags_dev = cl.array.to_device( queue, seg_boundary_flags) seg_boundaries.insert(0, 0) result_host = a.copy() for i, seg_start in enumerate(seg_boundaries): if i+1 < len(seg_boundaries): seg_end = seg_boundaries[i+1] else: seg_end = None if is_exclusive: result_host[seg_start+1:seg_end] = np.cumsum( a[seg_start:seg_end][:-1]) result_host[seg_start] = 0 else: result_host[seg_start:seg_end] = np.cumsum( a[seg_start:seg_end]) # print("REF", result_host) result_dev = cl.array.empty_like(a_dev) knl(a_dev, seg_boundary_flags_dev, result_dev) # print("RES", result_dev) is_correct = (result_dev.get() == result_host).all() if not is_correct: diff = result_dev.get() - result_host print("RES-REF", diff) print("ERRWHERE", np.where(diff)) print(n, list(seg_boundaries)) assert is_correct from gc import collect collect() print("%d excl:%s done" % (n, is_exclusive)) @pytest.mark.parametrize("scan_kernel", [GenericScanKernel, GenericDebugScanKernel]) def test_sort(ctx_factory, scan_kernel): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) dtype = np.int32 from pyopencl.algorithm import RadixSort sort = RadixSort(context, "int *ary", key_expr="ary[i]", sort_arg_names=["ary"], scan_kernel=scan_kernel) from pyopencl.clrandom import PhiloxGenerator rng = PhiloxGenerator(context, seed=15) from time import time # intermediate arrays for largest size cause out-of-memory on low-end GPUs for n in scan_test_counts[:-1]: if n >= 2000 and isinstance(scan_kernel, GenericDebugScanKernel): continue print(n) print(" rng") a_dev = rng.uniform(queue, (n,), dtype=dtype, a=0, b=2**16) a = a_dev.get() dev_start = time() print(" device") (a_dev_sorted,), _evt = sort(a_dev, key_bits=16) queue.finish() dev_end = time() print(" numpy") a_sorted = np.sort(a) numpy_end = time() assert (a_dev_sorted.get() == a_sorted).all() numpy_elapsed = numpy_end-dev_end dev_elapsed = dev_end-dev_start # windows clock has really low resolution (16 milliseconds) and the # difference in time will end up at zero for smaller array sizes. if numpy_elapsed != 0 and dev_elapsed != 0: print( " dev: {:.2f} MKeys/s numpy: {:.2f} MKeys/s ratio: {:.2f}x".format( 1e-6*n/dev_elapsed, 1e-6*n/numpy_elapsed, numpy_elapsed/dev_elapsed)) def test_list_builder(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.algorithm import ListOfListsBuilder builder = ListOfListsBuilder(context, [("mylist", np.int32)], """//CL// void generate(LIST_ARG_DECL USER_ARG_DECL index_type i) { int count = i % 4; for (int j = 0; j < count; ++j) { APPEND_mylist(count); } } """, arg_decls=[]) result, _evt = builder(queue, 2000) inf = result["mylist"] assert inf.count == 3000 assert (inf.lists.get()[-6:] == [1, 2, 2, 3, 3, 3]).all() def test_list_builder_with_memoryobject(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.algorithm import ListOfListsBuilder from pyopencl.tools import VectorArg builder = ListOfListsBuilder(context, [("mylist", np.int32)], """//CL// void generate(LIST_ARG_DECL USER_ARG_DECL index_type i) { APPEND_mylist(input_list[i]); } """, arg_decls=[VectorArg(float, "input_list")]) n = 10000 input_list = cl.array.zeros(queue, (n,), float) result, _evt = builder(queue, n, input_list.data) inf = result["mylist"] assert inf.count == n assert (inf.lists.get() == 0).all() def test_list_builder_with_offset(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.algorithm import ListOfListsBuilder from pyopencl.tools import VectorArg builder = ListOfListsBuilder(context, [("mylist", np.int32)], """//CL// void generate(LIST_ARG_DECL USER_ARG_DECL index_type i) { APPEND_mylist(input_list[i]); } """, arg_decls=[ VectorArg(float, "input_list", with_offset=True)]) n = 10000 input_list = cl.array.zeros(queue, (n + 10,), float) input_list[10:] = 1 result, _evt = builder(queue, n, input_list[10:]) inf = result["mylist"] assert inf.count == n assert (inf.lists.get() == 1).all() def test_list_builder_with_empty_elim(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.algorithm import ListOfListsBuilder builder = ListOfListsBuilder( context, [("mylist1", np.int32), ("mylist2", np.int32), ("mylist3", np.int32)], """//CL// void generate(LIST_ARG_DECL USER_ARG_DECL index_type i) { if (i % 5 == 0) { for (int j = 0; j < i / 5; ++j) { APPEND_mylist1(j); APPEND_mylist2(j + 1); APPEND_mylist3(j); } } } """, arg_decls=[], eliminate_empty_output_lists=["mylist1", "mylist2"]) result, _evt = builder(queue, 1000) mylist1 = result["mylist1"] assert mylist1.count == 19900 assert (mylist1.starts.get()[:5] == [0, 1, 3, 6, 10]).all() assert (mylist1.nonempty_indices.get()[:5] == [5, 10, 15, 20, 25]).all() assert (mylist1.lists.get()[:6] == [0, 0, 1, 0, 1, 2]).all() mylist2 = result["mylist2"] assert mylist2.count == 19900 assert (mylist2.lists.get()[:6] == [1, 1, 2, 1, 2, 3]).all() mylist3 = result["mylist3"] assert mylist3.count == 19900 assert (mylist3.starts.get()[:10] == [0, 0, 0, 0, 0, 0, 1, 1, 1, 1]).all() assert (mylist3.lists.get()[:6] == [0, 0, 1, 0, 1, 2]).all() def test_key_value_sorter(ctx_factory): from pytest import importorskip importorskip("mako") context = ctx_factory() queue = cl.CommandQueue(context) n = 10**5 nkeys = 2000 from pyopencl.clrandom import rand as clrand keys = clrand(queue, n, np.int32, b=nkeys) values = clrand(queue, n, np.int32, b=n).astype(np.int64) assert np.max(keys.get()) < nkeys from pyopencl.algorithm import KeyValueSorter kvs = KeyValueSorter(context) starts, lists, _evt = kvs(queue, keys, values, nkeys, starts_dtype=np.int32) starts = starts.get() lists = lists.get() mydict = {} for k, v in zip(keys.get(), values.get()): mydict.setdefault(k, []).append(v) for i in range(nkeys): start, end = starts[i:i+2] assert sorted(mydict[i]) == sorted(lists[start:end]) # }}} # {{{ bitonic sort @pytest.mark.parametrize("size", [ 512, 4, 16 ]) @pytest.mark.parametrize("dtype", [ np.int32, np.float32, np.float64 ]) @pytest.mark.bitonic def test_bitonic_sort(ctx_factory, size, dtype): ctx = ctx_factory() queue = cl.CommandQueue(ctx) dev = ctx.devices[0] if (dev.platform.name == "Apple" and dev.type & cl.device_type.CPU): pytest.xfail("Bitonic sort won't work on Apple CPU: no workgroup " "parallelism") if (dev.platform.name == "Portable Computing Language" and dtype == np.float64 and get_pocl_version(dev.platform) < (1, 0)): pytest.xfail("Double precision bitonic sort doesn't work on PoCL < 1.0") if dtype == np.float64 and not has_double_support(dev): from pytest import skip skip("double precision not supported on %s" % dev) # Requires https://github.com/intel/llvm/releases/tag/2022-WW50 or newer to pass # on Intel CL. import pyopencl.clrandom as clrandom from pyopencl.bitonic_sort import BitonicSort s = clrandom.rand(queue, (2, size, 3,), dtype, luxury=None, a=0, b=239482333) sgs = s.copy() # enqueue_marker crashes under CL 1.1 pocl if there is anything to wait for # (no clEnqueueWaitForEvents) https://github.com/inducer/pyopencl/pull/237 if (dev.platform.name == "Portable Computing Language" and cl.get_cl_header_version() < (1, 2)): sgs.finish() sorter = BitonicSort(ctx) sgs, _evt = sorter(sgs, axis=1) assert np.array_equal(np.sort(s.get(), axis=1), sgs.get()) @pytest.mark.parametrize("size", [ 0, 4, 2**14, 2**18, ]) @pytest.mark.parametrize("dtype", [ np.int32, np.float32, np.float64 ]) @pytest.mark.bitonic def test_bitonic_argsort(ctx_factory, size, dtype): import sys is_pypy = "__pypy__" in sys.builtin_module_names if not size and is_pypy: # https://bitbucket.org/pypy/numpy/issues/53/specifying-strides-on-zero-sized-array pytest.xfail("pypy doesn't seem to handle as_strided " "on zero-sized arrays very well") ctx = ctx_factory() queue = cl.CommandQueue(ctx) device = queue.device if device.platform.vendor == "The pocl project" \ and device.type & cl.device_type.GPU: pytest.xfail("bitonic argsort fails on PoCL + Nvidia," "at least the K40, as of PoCL 1.6, 2021-01-20") # Requires https://github.com/intel/llvm/releases/tag/2022-WW50 or newer to pass # on Intel CL. dev = ctx.devices[0] if (dev.platform.name == "Portable Computing Language" and sys.platform == "darwin"): pytest.xfail("Bitonic sort crashes on Apple PoCL") if (dev.platform.name == "Apple" and dev.type & cl.device_type.CPU): pytest.xfail("Bitonic sort won't work on Apple CPU: no workgroup " "parallelism") if (dev.platform.name == "Portable Computing Language" and dtype == np.float64 and get_pocl_version(dev.platform) < (1, 0)): pytest.xfail("Double precision bitonic sort doesn't work on PoCL < 1.0") if (dev.platform.name == "Intel(R) OpenCL" and size == 0): pytest.xfail("size-0 arange fails on Intel CL") if dtype == np.float64 and not has_double_support(dev): from pytest import skip skip("double precision not supported on %s" % dev) import pyopencl.clrandom as clrandom from pyopencl.bitonic_sort import BitonicSort index = cl.array.arange(queue, 0, size, 1, dtype=np.int32) m = clrandom.rand(queue, (size,), dtype, luxury=None, a=0, b=239432234) sorterm = BitonicSort(ctx) ms = m.copy() # enqueue_marker crashes under CL 1.1 PoCL if there is anything to wait for # (no clEnqueueWaitForEvents) https://github.com/inducer/pyopencl/pull/237 if (dev.platform.name == "Portable Computing Language" and cl.get_cl_header_version() < (1, 2)): ms.finish() index.finish() ms, _evt = sorterm(ms, idx=index, axis=0) assert np.array_equal(np.sort(m.get()), ms.get()) # may be False because of identical values in array # assert np.array_equal(np.argsort(m.get()), index.get()) # Check values by indices assert np.array_equal(m.get()[np.argsort(m.get())], m.get()[index.get()]) # }}} if __name__ == "__main__": if len(sys.argv) > 1: exec(sys.argv[1]) else: from pytest import main main([__file__]) # vim: filetype=pyopencl:fdm=marker pyopencl-2025.1/test/test_array.py0000644000000000000000000017644314332717401014111 0ustar00#! /usr/bin/env python __copyright__ = "Copyright (C) 2009 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import operator import platform import sys from itertools import product import numpy as np import numpy.linalg as la import pytest import pyopencl as cl import pyopencl.array as cl_array import pyopencl.cltypes as cltypes import pyopencl.tools as cl_tools from pyopencl.characterize import has_double_support, has_struct_arg_count_bug from pyopencl.clrandom import PhiloxGenerator, ThreefryGenerator from pyopencl.tools import ( pytest_generate_tests_for_pyopencl as pytest_generate_tests, # noqa: F401 ) _PYPY = cl._PYPY # {{{ helpers TO_REAL = { np.dtype(np.complex64): np.float32, np.dtype(np.complex128): np.float64 } def general_clrand(queue, shape, dtype): from pyopencl.clrandom import rand as clrand dtype = np.dtype(dtype) if dtype.kind == "c": real_dtype = dtype.type(0).real.dtype return clrand(queue, shape, real_dtype) + 1j*clrand(queue, shape, real_dtype) else: return clrand(queue, shape, dtype) def make_random_array(queue, dtype, size): from pyopencl.clrandom import rand dtype = np.dtype(dtype) if dtype.kind == "c": real_dtype = TO_REAL[dtype] return (rand(queue, shape=(size,), dtype=real_dtype).astype(dtype) + rand(queue, shape=(size,), dtype=real_dtype).astype(dtype) * dtype.type(1j)) else: return rand(queue, shape=(size,), dtype=dtype) # }}} # {{{ dtype-related # {{{ test_basic_complex def test_basic_complex(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand size = 500 ary = (rand(queue, shape=(size,), dtype=np.float32).astype(np.complex64) + rand(queue, shape=(size,), dtype=np.float32).astype(np.complex64) * 1j) assert ary.dtype != np.dtype(np.complex128) c = np.complex64(5+7j) host_ary = ary.get() assert la.norm((ary*c).get() - c*host_ary) < 1e-5 * la.norm(host_ary) # }}} # {{{ test_mix_complex def test_mix_complex(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) size = 10 dtypes = [ (np.float32, np.complex64), # (np.int32, np.complex64), ] dev = context.devices[0] if has_double_support(dev) and has_struct_arg_count_bug(dev) == "apple": dtypes.extend([ (np.float32, np.float64), ]) elif has_double_support(dev): dtypes.extend([ (np.float32, np.float64), (np.float32, np.complex128), (np.float64, np.complex64), (np.float64, np.complex128), ]) from operator import add, mul, sub, truediv for op in [add, sub, mul, truediv, pow]: for dtype_a0, dtype_b0 in dtypes: for dtype_a, dtype_b in [ (dtype_a0, dtype_b0), (dtype_b0, dtype_a0), ]: for is_scalar_a, is_scalar_b in [ (False, False), (False, True), (True, False), ]: if is_scalar_a: ary_a = make_random_array(queue, dtype_a, 1).get()[0] host_ary_a = ary_a else: ary_a = make_random_array(queue, dtype_a, size) host_ary_a = ary_a.get() if is_scalar_b: ary_b = make_random_array(queue, dtype_b, 1).get()[0] host_ary_b = ary_b else: ary_b = make_random_array(queue, dtype_b, size) host_ary_b = ary_b.get() print(op, dtype_a, dtype_b, is_scalar_a, is_scalar_b) dev_result = op(ary_a, ary_b).get() host_result = op(host_ary_a, host_ary_b) if host_result.dtype != dev_result.dtype: # This appears to be a numpy bug, where we get # served a Python complex that is really a # smaller numpy complex. print("HOST_DTYPE: {} DEV_DTYPE: {}".format( host_result.dtype, dev_result.dtype)) dev_result = dev_result.astype(host_result.dtype) err = la.norm(host_result-dev_result)/la.norm(host_result) print(err) correct = err < 1e-4 if not correct: print(host_result) print(dev_result) print(host_result - dev_result) assert correct # }}} # {{{ test_pow_neg1_vs_inv def test_pow_neg1_vs_inv(ctx_factory): ctx = ctx_factory() queue = cl.CommandQueue(ctx) device = ctx.devices[0] if not has_double_support(device): from pytest import skip skip("double precision not supported on %s" % device) if has_struct_arg_count_bug(device) == "apple": from pytest import xfail xfail("apple struct arg counting broken") a_dev = make_random_array(queue, np.complex128, 20000) res1 = (a_dev ** (-1)).get() res2 = (1/a_dev).get() ref = 1/a_dev.get() assert la.norm(res1-ref, np.inf) / la.norm(ref) < 1e-13 assert la.norm(res2-ref, np.inf) / la.norm(ref) < 1e-13 # }}} # {{{ test_vector_fill def test_vector_fill(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) a_gpu = cl_array.Array(queue, 100, dtype=cltypes.float4) a_gpu.fill(cltypes.make_float4(0.0, 0.0, 1.0, 0.0)) a = a_gpu.get() assert a.dtype == cltypes.float4 a_gpu = cl_array.zeros(queue, 100, dtype=cltypes.float4) # }}} # {{{ test_zeros_large_array def test_zeros_large_array(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) dev = queue.device if dev.platform.vendor == "Intel(R) Corporation" \ and platform.system() == "Windows": pytest.xfail("large array fail with out-of-host memory with" "Intel CPU runtime as of 2022-10-05") size = 2**28 + 1 if dev.address_bits == 64 and dev.max_mem_alloc_size >= 8 * size: # this shouldn't hang/cause errors # see https://github.com/inducer/pyopencl/issues/395 a_gpu = cl_array.zeros(queue, (size,), dtype="float64") # run a couple kernels to ensure no propagated runtime errors a_gpu[...] = 1. a_gpu = 2 * a_gpu - 3 else: pass # }}} # {{{ test_absrealimag def test_absrealimag(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) def real(x): return x.real def imag(x): return x.imag def conj(x): return x.conj() n = 111 for func in [abs, real, imag, conj]: for dtype in [np.int32, np.float32, np.complex64]: print(func, dtype) a = -make_random_array(queue, dtype, n) host_res = func(a.get()) dev_res = func(a).get() correct = np.allclose(dev_res, host_res) if not correct: print(dev_res) print(host_res) print(dev_res-host_res) assert correct # }}} # {{{ test_custom_type_zeros def test_custom_type_zeros(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) if not ( queue._get_cl_version() >= (1, 2) and cl.get_cl_header_version() >= (1, 2)): pytest.skip("CL1.2 not available") dtype = np.dtype([ ("cur_min", np.int32), ("cur_max", np.int32), ("pad", np.int32), ]) from pyopencl.tools import get_or_register_dtype, match_dtype_to_c_struct name = "mmc_type" dtype, _c_decl = match_dtype_to_c_struct(queue.device, name, dtype) dtype = get_or_register_dtype(name, dtype) n = 1000 z_dev = cl.array.zeros(queue, n, dtype=dtype) z = z_dev.get() assert np.array_equal(np.zeros(n, dtype), z) # }}} # {{{ test_custom_type_fill def test_custom_type_fill(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.characterize import has_struct_arg_count_bug if has_struct_arg_count_bug(queue.device): pytest.skip("device has LLVM arg counting bug") dtype = np.dtype([ ("cur_min", np.int32), ("cur_max", np.int32), ("pad", np.int32), ]) from pyopencl.tools import get_or_register_dtype, match_dtype_to_c_struct name = "mmc_type" dtype, _c_decl = match_dtype_to_c_struct(queue.device, name, dtype) dtype = get_or_register_dtype(name, dtype) n = 1000 z_dev = cl.array.empty(queue, n, dtype=dtype) z_dev.fill(np.zeros((), dtype)) z = z_dev.get() assert np.array_equal(np.zeros(n, dtype), z) # }}} # {{{ test_custom_type_take_put def test_custom_type_take_put(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) dtype = np.dtype([ ("cur_min", np.int32), ("cur_max", np.int32), ]) from pyopencl.tools import get_or_register_dtype, match_dtype_to_c_struct name = "tp_type" dtype, _c_decl = match_dtype_to_c_struct(queue.device, name, dtype) dtype = get_or_register_dtype(name, dtype) n = 100 z = np.empty(100, dtype) z["cur_min"] = np.arange(n) z["cur_max"] = np.arange(n)**2 z_dev = cl.array.to_device(queue, z) ind = cl.array.arange(queue, n, step=3, dtype=np.int32) z_ind_ref = z[ind.get()] z_ind = z_dev[ind] assert np.array_equal(z_ind.get(), z_ind_ref) # }}} # }}} # {{{ operators # {{{ test_div_type_matches_numpy @pytest.mark.parametrize("dtype", [np.int8, np.int32, np.int64, np.float32]) # FIXME Implement floordiv # @pytest.mark.parametrize("op", [operator.truediv, operator.floordiv]) @pytest.mark.parametrize("op", [operator.truediv]) def test_div_type_matches_numpy(ctx_factory, dtype, op): context = ctx_factory() queue = cl.CommandQueue(context) a = cl_array.arange(queue, 10, dtype=dtype) + 1 res = op(4*a, 3*a) a_np = a.get() res_np = op(4*a_np, 3*a_np) assert res_np.dtype == res.dtype assert np.allclose(res_np, res.get()) # }}} # {{{ test_rmul_yields_right_type def test_rmul_yields_right_type(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) a = np.array([1, 2, 3, 4, 5]).astype(np.float32) a_gpu = cl_array.to_device(queue, a) two_a = 2*a_gpu assert isinstance(two_a, cl_array.Array) two_a = np.float32(2)*a_gpu assert isinstance(two_a, cl_array.Array) # }}} # {{{ test_pow_array def test_pow_array(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) a = np.array([1, 2, 3, 4, 5]).astype(np.float32) a_gpu = cl_array.to_device(queue, a) result = pow(a_gpu, a_gpu).get() assert (np.abs(a ** a - result) < 3e-3).all() result = (a_gpu ** a_gpu).get() assert (np.abs(pow(a, a) - result) < 3e-3).all() # }}} # {{{ test_pow_number def test_pow_number(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).astype(np.float32) a_gpu = cl_array.to_device(queue, a) result = pow(a_gpu, 2).get() assert (np.abs(a ** 2 - result) < 1e-3).all() # }}} # {{{ test_multiply def test_multiply(ctx_factory): """Test the muliplication of an array with a scalar. """ context = ctx_factory() queue = cl.CommandQueue(context) for sz in [10, 50000]: for dtype, scalars in [ (np.float32, [2]), (np.complex64, [2j]), ]: for scalar in scalars: a_gpu = make_random_array(queue, dtype, sz) a = a_gpu.get() a_mult = (scalar * a_gpu).get() assert (a * scalar == a_mult).all() # }}} # {{{ test_multiply_array def test_multiply_array(ctx_factory): """Test the multiplication of two arrays.""" context = ctx_factory() queue = cl.CommandQueue(context) a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).astype(np.float32) a_gpu = cl_array.to_device(queue, a) b_gpu = cl_array.to_device(queue, a) a_squared = (b_gpu * a_gpu).get() assert (a * a == a_squared).all() # }}} # {{{ test_addition_array def test_addition_array(ctx_factory): """Test the addition of two arrays.""" context = ctx_factory() queue = cl.CommandQueue(context) a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).astype(np.float32) a_gpu = cl_array.to_device(queue, a) a_added = (a_gpu + a_gpu).get() assert (a + a == a_added).all() # }}} # {{{ test_addition_scalar def test_addition_scalar(ctx_factory): """Test the addition of an array and a scalar.""" context = ctx_factory() queue = cl.CommandQueue(context) a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).astype(np.float32) a_gpu = cl_array.to_device(queue, a) a_added = (7 + a_gpu).get() assert (7 + a == a_added).all() # }}} # {{{ test_subtract_array @pytest.mark.parametrize(("dtype_a", "dtype_b"), [ (np.float32, np.float32), (np.float32, np.int32), (np.int32, np.int32), (np.int64, np.int32), (np.int64, np.uint32), ]) def test_subtract_array(ctx_factory, dtype_a, dtype_b): """Test the subtraction of two arrays.""" # test data a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).astype(dtype_a) b = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]).astype(dtype_b) context = ctx_factory() queue = cl.CommandQueue(context) a_gpu = cl_array.to_device(queue, a) b_gpu = cl_array.to_device(queue, b) result = (a_gpu - b_gpu).get() assert (a - b == result).all() result = (b_gpu - a_gpu).get() assert (b - a == result).all() # }}} # {{{ test_subtract_scalar def test_subtract_scalar(ctx_factory): """Test the subtraction of an array and a scalar.""" context = ctx_factory() queue = cl.CommandQueue(context) # test data a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).astype(np.float32) # convert a to a gpu object a_gpu = cl_array.to_device(queue, a) result = (a_gpu - 7).get() assert (a - 7 == result).all() result = (7 - a_gpu).get() assert (7 - a == result).all() # }}} # {{{ test_divide_scalar def test_divide_scalar(ctx_factory): """Test the division of an array and a scalar.""" context = ctx_factory() queue = cl.CommandQueue(context) if queue.device.platform.name == "Apple": pytest.xfail("Apple CL compiler crashes on this.") dtypes = (np.uint8, np.uint16, np.uint32, np.int8, np.int16, np.int32, np.float32, np.complex64) from pyopencl.characterize import has_double_support if has_double_support(queue.device): dtypes = (*dtypes, np.float64, np.complex128) from itertools import product for dtype_a, dtype_s in product(dtypes, repeat=2): a = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]).astype(dtype_a) s = dtype_s(40) a_gpu = cl_array.to_device(queue, a) b = a / s b_gpu = a_gpu / s assert (np.abs(b_gpu.get() - b) < 1e-3).all() assert b_gpu.dtype is b.dtype c = s / a c_gpu = s / a_gpu assert (np.abs(c_gpu.get() - c) < 1e-3).all() assert c_gpu.dtype is c.dtype # }}} # {{{ test_divide_array def test_divide_array(ctx_factory): """Test the division of an array and a scalar. """ context = ctx_factory() queue = cl.CommandQueue(context) dtypes = (np.float32, np.complex64) from pyopencl.characterize import has_double_support if has_double_support(queue.device): dtypes = (*dtypes, np.float64, np.complex128) from itertools import product for dtype_a, dtype_b in product(dtypes, repeat=2): a = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]).astype(dtype_a) b = np.array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10]).astype(dtype_b) a_gpu = cl_array.to_device(queue, a) b_gpu = cl_array.to_device(queue, b) c = a / b c_gpu = (a_gpu / b_gpu) assert (np.abs(c_gpu.get() - c) < 1e-3).all() assert c_gpu.dtype is c.dtype d = b / a d_gpu = (b_gpu / a_gpu) assert (np.abs(d_gpu.get() - d) < 1e-3).all() assert d_gpu.dtype is d.dtype # }}} # {{{ test_divide_inplace_scalar def test_divide_inplace_scalar(ctx_factory): """Test inplace division of arrays and a scalar.""" context = ctx_factory() queue = cl.CommandQueue(context) if queue.device.platform.name == "Apple": pytest.xfail("Apple CL compiler crashes on this.") dtypes = (np.uint8, np.uint16, np.uint32, np.int8, np.int16, np.int32, np.float32, np.complex64) from pyopencl.characterize import has_double_support if has_double_support(queue.device): dtypes = (*dtypes, np.float64, np.complex128) from itertools import product for dtype_a, dtype_s in product(dtypes, repeat=2): a = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]).astype(dtype_a) s = dtype_s(40) a_gpu = cl_array.to_device(queue, a) # ensure the same behavior as inplace numpy.ndarray division try: a /= s except TypeError: with np.testing.assert_raises(TypeError): a_gpu /= s else: a_gpu /= s assert (np.abs(a_gpu.get() - a) < 1e-3).all() assert a_gpu.dtype is a.dtype # }}} # {{{ test_divide_inplace_array def test_divide_inplace_array(ctx_factory): """Test inplace division of arrays.""" context = ctx_factory() queue = cl.CommandQueue(context) dtypes = (np.uint8, np.uint16, np.uint32, np.int8, np.int16, np.int32, np.float32, np.complex64) from pyopencl.characterize import has_double_support if has_double_support(queue.device): dtypes = (*dtypes, np.float64, np.complex128) from itertools import product for dtype_a, dtype_b in product(dtypes, repeat=2): print(dtype_a, dtype_b) a = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]).astype(dtype_a) b = np.array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10]).astype(dtype_b) a_gpu = cl_array.to_device(queue, a) b_gpu = cl_array.to_device(queue, b) # ensure the same behavior as inplace numpy.ndarray division try: a_gpu /= b_gpu except TypeError: # pass for now, as numpy casts differently for in-place and out-place # true_divide pass # with np.testing.assert_raises(TypeError): # a /= b else: a /= b assert (np.abs(a_gpu.get() - a) < 1e-3).all() assert a_gpu.dtype is a.dtype # }}} # {{{ test_bitwise def test_bitwise(ctx_factory): if _PYPY: pytest.xfail("numpypy: missing bitwise ops") context = ctx_factory() queue = cl.CommandQueue(context) from itertools import product dtypes = [np.dtype(t) for t in (np.int64, np.int32, np.int16, np.int8)] from pyopencl.clrandom import rand as clrand for a_dtype, b_dtype in product(dtypes, dtypes): ary_len = 16 int32_min = np.iinfo(np.int32).min int32_max = np.iinfo(np.int32).max a_dev = clrand( queue, (ary_len,), a=int32_min, b=1+int32_max, dtype=np.int64 ).astype(a_dtype) b_dev = clrand( queue, (ary_len,), a=int32_min, b=1+int32_max, dtype=np.int64 ).astype(b_dtype) a = a_dev.get() b = b_dev.get() s = int(clrand(queue, (), a=int32_min, b=1+int32_max, dtype=np.int64) .astype(b_dtype).get()) import operator as o for op in [o.and_, o.or_, o.xor]: res_dev = op(a_dev, b_dev) res = op(a, b) assert (res_dev.get() == res).all() try: res = op(a, s) except OverflowError: pass else: res_dev = op(a_dev, s) assert (res_dev.get() == res).all() try: res = op(s, b) except OverflowError: pass else: res_dev = op(s, b_dev) assert (res_dev.get() == res).all() for op in [o.iand, o.ior, o.ixor]: res_dev = a_dev.copy() op_res = op(res_dev, b_dev) assert op_res is res_dev res = a.copy() try: op(res, b) except OverflowError: pass else: assert (res_dev.get() == res).all() res = a.copy() try: op(res, s) except OverflowError: pass else: res_dev = a_dev.copy() op_res = op(res_dev, s) assert op_res is res_dev assert (res_dev.get() == res).all() # Test unary ~ res_dev = ~a_dev res = ~a # pylint:disable=invalid-unary-operand-type assert (res_dev.get() == res).all() # }}} # }}} # {{{ RNG # {{{ test_random_float_in_range @pytest.mark.parametrize("rng_class", [PhiloxGenerator, ThreefryGenerator]) @pytest.mark.parametrize("ary_size", [300, 301, 302, 303, 10007, 1000000]) def test_random_float_in_range(ctx_factory, rng_class, ary_size, plot_hist=False): context = ctx_factory() queue = cl.CommandQueue(context) if has_double_support(context.devices[0]): dtypes = [np.float32, np.float64] else: dtypes = [np.float32] gen = rng_class(context) for dtype in dtypes: print(dtype) ran = cl_array.zeros(queue, ary_size, dtype) gen.fill_uniform(ran) if plot_hist: import matplotlib.pyplot as pt pt.hist(ran.get(), 30) pt.show() assert (0 <= ran.get()).all() assert (ran.get() <= 1).all() ran = cl_array.zeros(queue, ary_size, dtype) gen.fill_uniform(ran, a=4, b=7) ran_host = ran.get() for cond in [4 <= ran_host, ran_host <= 7]: good = cond.all() if not good: print(np.where(~cond)) print(ran_host[~cond]) assert good ran = gen.normal(queue, ary_size, dtype, mu=10, sigma=3) if plot_hist: import matplotlib.pyplot as pt pt.hist(ran.get(), 30) pt.show() # }}} # {{{ test_random_int_in_range @pytest.mark.parametrize("dtype", [np.int32, np.int64]) @pytest.mark.parametrize("rng_class", [PhiloxGenerator, ThreefryGenerator]) def test_random_int_in_range(ctx_factory, rng_class, dtype, plot_hist=False): context = ctx_factory() queue = cl.CommandQueue(context) gen = rng_class(context) # if (dtype == np.int64 # and context.devices[0].platform.vendor.startswith("Advanced Micro")): # pytest.xfail("AMD miscompiles 64-bit RNG math") ran = gen.uniform(queue, (10000007,), dtype, a=200, b=300).get() assert (200 <= ran).all() assert (ran < 300).all() print(np.min(ran), np.max(ran)) assert np.max(ran) > 295 if plot_hist: from matplotlib import pyplot as pt pt.hist(ran) pt.show() # }}} # }}} # {{{ misc # {{{ test_numpy_integer_shape def test_numpy_integer_shape(ctx_factory): try: list(np.int32(17)) except Exception: pass else: from pytest import skip skip("numpy implementation does not handle scalar correctly.") context = ctx_factory() queue = cl.CommandQueue(context) cl_array.empty(queue, np.int32(17), np.float32) cl_array.empty(queue, (np.int32(17), np.int32(17)), np.float32) # }}} # {{{ test_allocation_with_various_shape_scalar_types def test_allocation_with_various_shape_scalar_types(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) dims_ok = (2, np.int32(7), np.uint64(1)) dims_not_ok = (-1, 5.70, np.float32(7)) shapes_ok_1d = list(product(dims_ok)) shapes_ok_2d = list(product(dims_ok, dims_ok)) shapes_ok_3d = list(product(dims_ok, dims_ok, dims_ok)) shapes_not_ok_1d = list(product(dims_not_ok)) shapes_not_ok_2d = list(product(dims_ok, dims_not_ok)) shapes_not_ok_3d = list(product(dims_not_ok, dims_not_ok, dims_not_ok)) for shape in shapes_ok_1d + shapes_ok_2d + shapes_ok_3d: cl_array.empty(queue, shape, np.float32) for shape in shapes_not_ok_1d + shapes_not_ok_2d + shapes_not_ok_3d: with pytest.raises(ValueError): cl_array.empty(queue, shape, np.float32) # }}} # {{{ test_len def test_len(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).astype(np.float32) a_cpu = cl_array.to_device(queue, a) assert len(a_cpu) == 10 # }}} # {{{ test_stride_preservation def test_stride_preservation(ctx_factory): if _PYPY: pytest.xfail("numpypy: no array creation from __array_interface__") context = ctx_factory() queue = cl.CommandQueue(context) rng = np.random.default_rng(seed=42) a = rng.random(size=(3, 3)) at = a.T print(at.flags.f_contiguous, at.flags.c_contiguous) at_gpu = cl_array.to_device(queue, at) print(at_gpu.flags.f_contiguous, at_gpu.flags.c_contiguous) assert np.allclose(at_gpu.get(), at) # }}} # {{{ test_nan_arithmetic def test_nan_arithmetic(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) rng = np.random.default_rng(seed=42) def make_nan_contaminated_vector(size): a = rng.standard_normal(size=(size,), dtype=np.float32) from random import randrange for _i in range(size // 10): a[randrange(0, size)] = float("nan") return a size = 1 << 20 a = make_nan_contaminated_vector(size) a_gpu = cl_array.to_device(queue, a) b = make_nan_contaminated_vector(size) b_gpu = cl_array.to_device(queue, b) ab = a * b ab_gpu = (a_gpu * b_gpu).get() assert (np.isnan(ab) == np.isnan(ab_gpu)).all() # }}} # {{{ test_mem_pool_with_arrays def test_mem_pool_with_arrays(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) mem_pool = cl_tools.MemoryPool(cl_tools.ImmediateAllocator(queue)) a_dev = cl_array.arange(queue, 2000, dtype=np.float32, allocator=mem_pool) b_dev = cl_array.to_device(queue, np.arange(2000), allocator=mem_pool) + 4000 assert a_dev.allocator is mem_pool assert b_dev.allocator is mem_pool # }}} # {{{ test_view def test_view(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) a = np.arange(128).reshape(8, 16).astype(np.float32) a_dev = cl_array.to_device(queue, a) # same dtype view = a_dev.view() assert view.shape == a_dev.shape and view.dtype == a_dev.dtype # larger dtype view = a_dev.view(np.complex64) assert view.shape == (8, 8) and view.dtype == np.complex64 # smaller dtype view = a_dev.view(np.int16) assert view.shape == (8, 32) and view.dtype == np.int16 # }}} # {{{ test_diff def test_diff(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand ary_len = 20000 a_dev = clrand(queue, (ary_len,), dtype=np.float32) a = a_dev.get() err = la.norm( cl.array.diff(a_dev).get() - np.diff(a)) assert err < 1e-4 # }}} # {{{ test_copy def test_copy(ctx_factory): context = ctx_factory() queue1 = cl.CommandQueue(context) queue2 = cl.CommandQueue(context) # Test copy arr = cl.array.zeros(queue1, 100, np.int32) arr_copy = arr.copy() assert (arr == arr_copy).all().get() assert arr.data != arr_copy.data assert arr_copy.queue is queue1 # Test queue association arr_copy = arr.copy(queue=queue2) assert arr_copy.queue is queue2 arr_copy = arr.copy(queue=None) assert arr_copy.queue is None arr_copy = arr.with_queue(None).copy(queue=queue1) assert arr_copy.queue is queue1 # }}} # }}} # {{{ slices, concatenation # {{{ test_slice def test_slice(ctx_factory): if _PYPY: pytest.xfail("numpypy: spurious as_strided failure") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand tp = np.float32 ary_len = 20000 a_gpu = clrand(queue, (ary_len,), dtype=tp) b_gpu = clrand(queue, (ary_len,), dtype=tp) a = a_gpu.get() b = b_gpu.get() start_offset = 0 if queue.device.platform.name == "Intel(R) OpenCL": pytest.skip("Intel CL regularly crashes on this test case " "-- https://github.com/conda-forge/" "intel-compiler-repack-feedstock/issues/7") from random import randrange for _i in range(20): start = randrange(ary_len - start_offset) end = randrange(start+start_offset, ary_len) a_gpu_slice = tp(2)*a_gpu[start:end] a_slice = tp(2)*a[start:end] assert la.norm(a_gpu_slice.get() - a_slice) == 0 for _i in range(20): start = randrange(ary_len-start_offset) # end = randrange(start+start_offset, ary_len) end = start a_gpu[start:end] = tp(2)*b[start:end] a[start:end] = tp(2)*b[start:end] assert la.norm(a_gpu.get() - a) == 0 for _i in range(20): start = randrange(ary_len-start_offset) end = randrange(start+start_offset, ary_len) a_gpu[start:end] = tp(2)*b_gpu[start:end] a[start:end] = tp(2)*b[start:end] assert la.norm(a_gpu.get() - a) == 0 # }}} # {{{ test_concatenate def test_concatenate(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand a_dev = clrand(queue, (5, 15, 20), dtype=np.float32) b_dev = clrand(queue, (4, 15, 20), dtype=np.float32) c_dev = clrand(queue, (3, 15, 20), dtype=np.float32) a = a_dev.get() b = b_dev.get() c = c_dev.get() cat_dev = cl.array.concatenate((a_dev, b_dev, c_dev)) cat = np.concatenate((a, b, c)) assert la.norm(cat - cat_dev.get()) == 0 # }}} # }}} # {{{ conditionals, any, all # {{{ test_comparisons def test_comparisons(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand ary_len = 20000 a_dev = clrand(queue, (ary_len,), dtype=np.float32) b_dev = clrand(queue, (ary_len,), dtype=np.float32) a = a_dev.get() b = b_dev.get() import operator as o for op in [o.eq, o.ne, o.le, o.lt, o.ge, o.gt]: res_dev = op(a_dev, b_dev) res = op(a, b) assert (res_dev.get() == res).all() res_dev = op(a_dev, 0) res = op(a, 0) assert (res_dev.get() == res).all() res_dev = op(0, b_dev) res = op(0, b) assert (res_dev.get() == res).all() res2_dev = op(0, res_dev) res2 = op(0, res) assert (res2_dev.get() == res2).all() # }}} # {{{ test_any_all def test_any_all(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) ary_len = 20000 a_dev = cl_array.zeros(queue, (ary_len,), dtype=np.int8) assert not a_dev.all().get() assert not a_dev.any().get() a_dev[15213] = 1 assert not a_dev.all().get() assert a_dev.any().get() a_dev.fill(1) assert a_dev.all().get() assert a_dev.any().get() # }}} # }}} # {{{ test_map_to_host def test_map_to_host(ctx_factory): if _PYPY: pytest.skip("numpypy: no array creation from __array_interface__") context = ctx_factory() queue = cl.CommandQueue(context) if context.devices[0].type & cl.device_type.GPU: mf = cl.mem_flags allocator = cl_tools.DeferredAllocator( context, mf.READ_WRITE | mf.ALLOC_HOST_PTR) else: allocator = None a_dev = cl_array.zeros(queue, (5, 6, 7,), dtype=np.float32, allocator=allocator) a_dev[3, 2, 1] = 10 a_host = a_dev.map_to_host() a_host[1, 2, 3] = 10 a_host_saved = a_host.copy() a_host.base.release(queue) a_dev.finish() print("DEV[HOST_WRITE]", a_dev.get()[1, 2, 3]) print("HOST[DEV_WRITE]", a_host_saved[3, 2, 1]) assert (a_host_saved == a_dev.get()).all() # }}} # {{{ test_view_and_strides def test_view_and_strides(ctx_factory): if _PYPY: pytest.xfail("numpypy: no array creation from __array_interface__") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand x = clrand(queue, (5, 10), dtype=np.float32) y = x[:3, :5] yv = y.view() assert yv.shape == y.shape assert yv.strides == y.strides with pytest.raises(AssertionError): assert (yv.get() == x.get()[:3, :5]).all() # }}} # {{{ test_meshmode_view def test_meshmode_view(ctx_factory): if _PYPY: # https://bitbucket.org/pypy/numpy/issue/28/indexerror-on-ellipsis-slice pytest.xfail("numpypy bug #28") context = ctx_factory() queue = cl.CommandQueue(context) n = 2 result = cl.array.empty(queue, (2, n*6), np.float32) def view(z): return z[..., n*3:n*6].reshape(z.shape[:-1] + (n, 3)) result = result.with_queue(queue) result.fill(0) view(result)[0].fill(1) view(result)[1].fill(1) x = result.get() assert (view(x) == 1).all() # }}} # {{{ test_event_management def test_event_management(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand x = clrand(queue, (5, 10), dtype=np.float32) assert len(x.events) == 1, len(x.events) x.finish() assert len(x.events) == 0 y = x+x assert len(y.events) == 1 y = x*x assert len(y.events) == 1 y = 2*x assert len(y.events) == 1 y = 2/x assert len(y.events) == 1 y = x/2 assert len(y.events) == 1 y = x**2 assert len(y.events) == 1 y = 2**x assert len(y.events) == 1 for _i in range(10): x.fill(0) assert len(x.events) == 10 for _i in range(1000): x.fill(0) assert len(x.events) < 100 # }}} # {{{ test_reshape def test_reshape(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) a = np.arange(128).reshape(8, 16).astype(np.float32) a_dev = cl_array.to_device(queue, a) # different ways to specify the shape a_dev.reshape(4, 32) a_dev.reshape((4, 32)) a_dev.reshape([4, 32]) # using -1 as unknown dimension assert a_dev.reshape(-1, 32).shape == (4, 32) assert a_dev.reshape((32, -1)).shape == (32, 4) assert a_dev.reshape((8, -1, 4)).shape == (8, 4, 4) import pytest with pytest.raises(ValueError): a_dev.reshape(-1, -1, 4) # }}} # {{{ test_skip_slicing def test_skip_slicing(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) a_host = np.arange(16).reshape((4, 4)) b_host = a_host[::3] a = cl_array.to_device(queue, a_host) b = a[::3] assert b.shape == b_host.shape # pylint:disable=unsubscriptable-object assert np.array_equal(b[1].get(), b_host[1]) # }}} # {{{ test_transpose def test_transpose(ctx_factory): if _PYPY: pytest.xfail("numpypy: no array creation from __array_interface__") context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand a_gpu = clrand(queue, (10, 20, 30), dtype=np.float32) a = a_gpu.get() # FIXME: not contiguous # assert np.allclose(a_gpu.transpose((1,2,0)).get(), a.transpose((1,2,0))) assert np.array_equal(a_gpu.T.get(), a.T) # }}} # {{{ test_newaxis def test_newaxis(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) from pyopencl.clrandom import rand as clrand a_gpu = clrand(queue, (10, 20, 30), dtype=np.float32) a = a_gpu.get() b_gpu = a_gpu[:, np.newaxis] b = a[:, np.newaxis] assert b_gpu.shape == b.shape for i in range(b.ndim): if b.shape[i] > 1: assert b_gpu.strides[i] == b.strides[i] # }}} # {{{ test_squeeze def test_squeeze(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) rng = np.random.default_rng(seed=42) shape = (40, 2, 5, 100) a_cpu = rng.random(size=shape) a_gpu = cl_array.to_device(queue, a_cpu) # Slice with length 1 on dimensions 0 and 1 a_gpu_slice = a_gpu[0:1, 1:2, :, :] assert a_gpu_slice.shape == (1, 1, shape[2], shape[3]) assert a_gpu_slice.flags.c_contiguous # Squeeze it and obtain contiguity a_gpu_squeezed_slice = a_gpu[0:1, 1:2, :, :].squeeze() assert a_gpu_squeezed_slice.shape == (shape[2], shape[3]) assert a_gpu_squeezed_slice.flags.c_contiguous # Check that we get the original values out # assert np.all(a_gpu_slice.get().ravel() == a_gpu_squeezed_slice.get().ravel()) # Slice with length 1 on dimensions 2 a_gpu_slice = a_gpu[:, :, 2:3, :] assert a_gpu_slice.shape == (shape[0], shape[1], 1, shape[3]) assert not a_gpu_slice.flags.c_contiguous # Squeeze it, but no contiguity here a_gpu_squeezed_slice = a_gpu[:, :, 2:3, :].squeeze() assert a_gpu_squeezed_slice.shape == (shape[0], shape[1], shape[3]) assert not a_gpu_squeezed_slice.flags.c_contiguous # Check that we get the original values out # assert np.all(a_gpu_slice.get().ravel() == a_gpu_squeezed_slice.get().ravel()) # }}} # {{{ test_fancy_fill def test_fancy_fill(ctx_factory): if _PYPY: pytest.xfail("numpypy: multi value setting is not supported") context = ctx_factory() queue = cl.CommandQueue(context) numpy_dest = np.zeros((4,), np.int32) numpy_idx = np.arange(3, dtype=np.int32) numpy_src = np.arange(8, 9, dtype=np.int32) numpy_dest[numpy_idx] = numpy_src cl_dest = cl_array.zeros(queue, (4,), np.int32) cl_idx = cl_array.arange(queue, 3, dtype=np.int32) cl_src = cl_array.arange(queue, 8, 9, dtype=np.int32) cl_dest[cl_idx] = cl_src assert np.all(numpy_dest == cl_dest.get()) # }}} # {{{ test_fancy_indexing def test_fancy_indexing(ctx_factory): if _PYPY: pytest.xfail("numpypy: multi value setting is not supported") context = ctx_factory() queue = cl.CommandQueue(context) rng = np.random.default_rng(seed=42) n = 2 ** 20 + 2**18 + 22 numpy_dest = np.zeros(n, dtype=np.int32) numpy_idx = np.arange(n, dtype=np.int32) rng.shuffle(numpy_idx) numpy_src = 20000+np.arange(n, dtype=np.int32) cl_dest = cl_array.to_device(queue, numpy_dest) cl_idx = cl_array.to_device(queue, numpy_idx) cl_src = cl_array.to_device(queue, numpy_src) numpy_dest[numpy_idx] = numpy_src cl_dest[cl_idx] = cl_src assert np.array_equal(numpy_dest, cl_dest.get()) numpy_dest = numpy_src[numpy_idx] cl_dest = cl_src[cl_idx] assert np.array_equal(numpy_dest, cl_dest.get()) # }}} # {{{ test_multi_put def test_multi_put(ctx_factory): if _PYPY: pytest.xfail("numpypy: multi value setting is not supported") context = ctx_factory() queue = cl.CommandQueue(context) cl_arrays = [ cl_array.arange(queue, 0, 3, dtype=np.float32) for i in range(1, 10) ] idx = cl_array.arange(queue, 0, 6, dtype=np.int32) out_arrays = [ cl_array.zeros(queue, (10,), np.float32) for i in range(9) ] out_compare = [np.zeros((10,), np.float32) for i in range(9)] for _i, ary in enumerate(out_compare): ary[idx.get()] = np.arange(0, 6, dtype=np.float32) cl_array.multi_put(cl_arrays, idx, out=out_arrays) assert np.all(np.all(out_compare[i] == out_arrays[i].get()) for i in range(9)) # }}} # {{{ test_get_async def test_get_async(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) device = queue.device if device.platform.vendor == "The pocl project" \ and device.type & cl.device_type.GPU: pytest.xfail("the async get test fails on PoCL + Nvidia," "at least the K40, as of PoCL 1.6, 2021-01-20") rng = np.random.default_rng(seed=42) a = rng.random(10**6, dtype=np.float32) a_gpu = cl_array.to_device(queue, a) b = a + a**5 + 1 b_gpu = a_gpu + a_gpu**5 + 1 # deprecated, but still test with pytest.deprecated_call(): b1 = b_gpu.get(async_=True) # testing that this waits for events b_gpu.finish() assert np.abs(b1 - b).mean() < 1e-5 b1, evt = b_gpu.get_async() # testing that this waits for events evt.wait() assert np.abs(b1 - b).mean() < 1e-5 wait_event = cl.UserEvent(context) b_gpu.add_event(wait_event) b, evt = b_gpu.get_async() # testing that this doesn't hang wait_event.set_status(cl.command_execution_status.COMPLETE) evt.wait() assert np.abs(b1 - b).mean() < 1e-5 # }}} # {{{ test_outoforderqueue_get def test_outoforderqueue_get(ctx_factory): context = ctx_factory() try: queue = cl.CommandQueue(context, properties=cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE) except Exception: pytest.skip("out-of-order queue not available") rng = np.random.default_rng(seed=42) a = rng.random(10**6, dtype=np.float32) a_gpu = cl_array.to_device(queue, a) b_gpu = a_gpu + a_gpu**5 + 1 b1 = b_gpu.get() # testing that this waits for events b = a + a**5 + 1 assert np.abs(b1 - b).mean() < 1e-5 # }}} # {{{ test_outoforderqueue_copy def test_outoforderqueue_copy(ctx_factory): context = ctx_factory() try: queue = cl.CommandQueue(context, properties=cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE) except Exception: pytest.skip("out-of-order queue not available") rng = np.random.default_rng(seed=42) a = rng.random(10**6, dtype=np.float32) a_gpu = cl_array.to_device(queue, a) c_gpu = a_gpu**2 - 7 b_gpu = c_gpu.copy() # testing that this waits for and creates events b_gpu *= 10 queue.finish() b1 = b_gpu.get() b = 10 * (a**2 - 7) assert np.abs(b1 - b).mean() < 1e-5 # }}} # {{{ test_outoforderqueue_indexing def test_outoforderqueue_indexing(ctx_factory): context = ctx_factory() try: queue = cl.CommandQueue(context, properties=cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE) except Exception: pytest.skip("out-of-order queue not available") rng = np.random.default_rng(seed=42) a = rng.random(10**6, dtype=np.float32) i = (8e5 + 1e5 * rng.random(10**5)).astype(np.int32) a_gpu = cl_array.to_device(queue, a) i_gpu = cl_array.to_device(queue, i) c_gpu = (a_gpu**2)[i_gpu - 10000] b_gpu = 10 - a_gpu b_gpu[:] = 8 * a_gpu b_gpu[i_gpu + 10000] = c_gpu - 10 queue.finish() b1 = b_gpu.get() c = (a**2)[i - 10000] b = 8 * a b[i + 10000] = c - 10 assert np.abs(b1 - b).mean() < 1e-5 # }}} # {{{ test_outoforderqueue_reductions def test_outoforderqueue_reductions(ctx_factory): context = ctx_factory() try: queue = cl.CommandQueue(context, properties=cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE) except Exception: pytest.skip("out-of-order queue not available") # 0/1 values to avoid accumulated rounding error rng = np.random.default_rng(seed=42) a = (rng.random(10**6) > 0.5).astype(np.float32) a[800000] = 10 # all<5 looks true until near the end a_gpu = cl_array.to_device(queue, a) b1 = cl_array.sum(a_gpu).get() b2 = cl_array.dot(a_gpu, 3 - a_gpu).get() b3 = (a_gpu < 5).all().get() assert b1 == a.sum() and b2 == a.dot(3 - a) and b3 == 0 # }}} # {{{ test_negative_dim_rejection def test_negative_dim_rejection(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) with pytest.raises(ValueError): cl_array.Array(queue, shape=-10, dtype=np.float64) with pytest.raises(ValueError): cl_array.Array(queue, shape=(-10,), dtype=np.float64) for left_dim in (-1, 0, 1): with pytest.raises(ValueError): cl_array.Array(queue, shape=(left_dim, -1), dtype=np.float64) for right_dim in (-1, 0, 1): with pytest.raises(ValueError): cl_array.Array(queue, shape=(-1, right_dim), dtype=np.float64) # }}} # {{{ test_zero_size_array @pytest.mark.parametrize("empty_shape", [0, (), (3, 0, 2), (0, 5), (5, 0)]) def test_zero_size_array(ctx_factory, empty_shape): context = ctx_factory() queue = cl.CommandQueue(context) if queue.device.platform.name == "Intel(R) OpenCL": pytest.xfail("size-0 arrays fail on Intel CL") a = cl_array.zeros(queue, empty_shape, dtype=np.float32) b = cl_array.zeros(queue, empty_shape, dtype=np.float32) b.fill(1) c = a + b c_host = c.get() cl_array.to_device(queue, c_host) assert c.flags.c_contiguous == c_host.flags.c_contiguous assert c.flags.f_contiguous == c_host.flags.f_contiguous for order in "CF": c_flat = c.reshape(-1, order=order) c_host_flat = c_host.reshape(-1, order=order) assert c_flat.shape == c_host_flat.shape assert c_flat.strides == c_host_flat.strides assert c_flat.flags.c_contiguous == c_host_flat.flags.c_contiguous assert c_flat.flags.f_contiguous == c_host_flat.flags.f_contiguous # }}} # {{{ test_str_without_queue def test_str_without_queue(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) a = cl_array.zeros(queue, 10, dtype=np.float32).with_queue(None) print(str(a)) print(repr(a)) # }}} # {{{ test_stack @pytest.mark.parametrize("order", ("F", "C")) @pytest.mark.parametrize("input_dims", (1, 2, 3)) def test_stack(ctx_factory, input_dims, order): # Replicates pytato/test/test_codegen.py::test_stack import pyopencl.array as cla cl_ctx = ctx_factory() queue = cl.CommandQueue(cl_ctx) shape = (2, 2, 2)[:input_dims] axis = -1 if order == "F" else 0 rng = np.random.default_rng(seed=42) x_in = rng.random(size=shape) y_in = rng.random(size=shape) x_in = x_in if order == "C" else np.asfortranarray(x_in) y_in = y_in if order == "C" else np.asfortranarray(y_in) x = cla.to_device(queue, x_in) y = cla.to_device(queue, y_in) np.testing.assert_allclose(cla.stack((x, y), axis=axis).get(), np.stack((x_in, y_in), axis=axis)) # }}} # {{{ test_assign_different_strides def test_assign_different_strides(ctx_factory): cl_ctx = ctx_factory() queue = cl.CommandQueue(cl_ctx) from pyopencl.clrandom import rand as clrand a = clrand(queue, (20, 30), dtype=np.float32) b = cl_array.empty(queue, (20, 30), dtype=np.float32, order="F") with pytest.raises(NotImplementedError): b[:] = a # }}} # {{{ test_branch_operations_on_pure_scalars def test_branch_operations_on_pure_scalars(): rng = np.random.default_rng(seed=42) x = rng.random() y = rng.random() cond = rng.choice([False, True]) np.testing.assert_allclose(np.maximum(x, y), cl_array.maximum(x, y)) np.testing.assert_allclose(np.minimum(x, y), cl_array.minimum(x, y)) np.testing.assert_allclose(np.where(cond, x, y), cl_array.if_positive(cond, x, y)) # }}} # {{{ test_branch_operations_on_nans @pytest.mark.parametrize("op", [ cl_array.maximum, cl_array.minimum, ]) @pytest.mark.parametrize("special_a", [ np.nan, np.inf, -np.inf, ]) @pytest.mark.parametrize("special_b", [ np.nan, np.inf, -np.inf, None ]) def test_branch_operations_on_nans(ctx_factory, op, special_a, special_b): ctx = ctx_factory() cq = cl.CommandQueue(ctx) def sb_or(x): if special_b is None: return x else: return special_b x_np = np.array([special_a, sb_or(1.), special_a, sb_or(2.), sb_or(3.)], dtype=np.float64) y_np = np.array([special_a, special_a, sb_or(1.), sb_or(3.), sb_or(2.)], dtype=np.float64) x_cl = cl_array.to_device(cq, x_np) y_cl = cl_array.to_device(cq, y_np) ref = getattr(np, op.__name__)(x_np, y_np) result = op(x_cl, y_cl) if isinstance(result, cl_array.Array): result = result.get() np.testing.assert_allclose(result, ref) # }}} # {{{ test_slice_copy def test_slice_copy(ctx_factory): cl_ctx = ctx_factory() queue = cl.CommandQueue(cl_ctx) rng = np.random.default_rng(seed=42) x = cl.array.to_device(queue, rng.random(size=(96, 27))) y = x[::8, ::3] with pytest.raises(RuntimeError): y.copy() # }}} # {{{{ test_ravel @pytest.mark.parametrize("order", ("C", "F")) def test_ravel(ctx_factory, order): ctx = ctx_factory() cq = cl.CommandQueue(ctx) rng = np.random.default_rng(seed=42) x = rng.standard_normal(size=(10, 4)) if order == "F": x = np.asfortranarray(x) elif order == "C": pass else: raise AssertionError x_cl = cl.array.to_device(cq, x) np.testing.assert_allclose(x_cl.ravel(order=order).get(), x.ravel(order=order)) # }}} # {{{ test_arithmetic_on_non_scalars def test_arithmetic_on_non_scalars(ctx_factory): pytest.importorskip("dataclasses") from dataclasses import dataclass ctx = ctx_factory() cq = cl.CommandQueue(ctx) @dataclass class ArrayContainer: _data: np.ndarray def __eq__(self, other): return ArrayContainer(self._data == other) with pytest.raises(TypeError): ArrayContainer(np.ones(100)) + cl.array.zeros(cq, (10,), dtype=np.float64) # }}} # {{{ test_arithmetic_with_device_scalars @pytest.mark.parametrize("which", ("add", "sub", "mul", "truediv")) def test_arithmetic_with_device_scalars(ctx_factory, which): import operator ctx = ctx_factory() cq = cl.CommandQueue(ctx) rng = np.random.default_rng(seed=42) ndim = rng.integers(1, 5) shape = tuple(rng.integers(2, 7) for i in range(ndim)) x_in = rng.random(shape) x_cl = cl_array.to_device(cq, x_in) idx = tuple(rng.integers(0, dim) for dim in shape) op = getattr(operator, which) res_cl = op(x_cl, x_cl[idx]) res_np = op(x_in, x_in[idx]) np.testing.assert_allclose(res_cl.get(), res_np) # }}} # {{{ test_if_positive_with_scalars @pytest.mark.parametrize("then_type", ["array", "host_scalar", "device_scalar"]) @pytest.mark.parametrize("else_type", ["array", "host_scalar", "device_scalar"]) def test_if_positive_with_scalars(ctx_factory, then_type, else_type): ctx = ctx_factory() cq = cl.CommandQueue(ctx) rng = np.random.default_rng(seed=42) shape = (512,) criterion_np = rng.random(shape) criterion_cl = cl_array.to_device(cq, criterion_np) def _get_array_or_scalar(rtype, value): if rtype == "array": ary_np = value + np.zeros(shape, dtype=criterion_cl.dtype) ary_cl = value + cl_array.zeros_like(criterion_cl) elif rtype == "host_scalar": ary_np = ary_cl = value elif rtype == "device_scalar": ary_np = value ary_cl = cl_array.to_device(cq, np.array(value)) else: raise ValueError(rtype) return ary_np, ary_cl then_np, then_cl = _get_array_or_scalar(then_type, 0.0) else_np, else_cl = _get_array_or_scalar(else_type, 1.0) result_cl = cl_array.if_positive(criterion_cl < 0.5, then_cl, else_cl) result_np = np.where(criterion_np < 0.5, then_np, else_np) np.testing.assert_allclose(result_cl.get(), result_np) # }}} # {{{ test_maximum_minimum_with_scalars def test_maximum_minimum_with_scalars(ctx_factory): ctx = ctx_factory() cq = cl.CommandQueue(ctx) a_np = np.float64(4.0) a_cl = cl_array.to_device(cq, np.array(a_np)).with_queue(None) b_np = np.float64(-3.0) b_cl = cl_array.to_device(cq, np.array(b_np)).with_queue(None) result = cl_array.maximum(a_np, b_cl, queue=cq) np.testing.assert_allclose(result.get(), a_np) result = cl_array.maximum(a_cl, b_np, queue=cq) np.testing.assert_allclose(result.get(), a_np) result = cl_array.maximum(a_cl, b_cl, queue=cq) np.testing.assert_allclose(result.get(), a_np) result = cl_array.minimum(a_np, b_cl, queue=cq) np.testing.assert_allclose(result.get(), b_np) result = cl_array.minimum(a_cl, b_np, queue=cq) np.testing.assert_allclose(result.get(), b_np) result = cl_array.minimum(a_cl, b_cl, queue=cq) np.testing.assert_allclose(result.get(), b_np) # Test 'untyped' scalars # FIXME: these don't work with unsized ints result = cl_array.minimum(4.0, b_cl, queue=cq) np.testing.assert_allclose(result.get(), b_np) result = cl_array.maximum(4.0, b_cl, queue=cq) np.testing.assert_allclose(result.get(), a_np) result = cl_array.minimum(b_cl, 4.0, queue=cq) np.testing.assert_allclose(result.get(), b_np) result = cl_array.maximum(b_cl, 4.0, queue=cq) np.testing.assert_allclose(result.get(), a_np) result = cl_array.minimum(-3.0, 4.0, queue=cq) np.testing.assert_allclose(result, b_np) result = cl_array.maximum(-3.0, 4.0, queue=cq) np.testing.assert_allclose(result, a_np) # }}} # {{{ test_empty_reductions_vs_numpy @pytest.mark.parametrize(("reduction", "supports_initial"), [ (cl_array.any, False), (cl_array.all, False), (cl_array.sum, True), (cl_array.max, True), (cl_array.min, True), ]) def test_empty_reductions_vs_numpy(ctx_factory, reduction, supports_initial): ctx = ctx_factory() cq = cl.CommandQueue(ctx) # {{{ empty x_np = np.array([], dtype=np.float64) x_cl = cl_array.to_device(cq, x_np) try: ref = getattr(np, reduction.__name__)(x_np) except ValueError: ref = None if ref is None: with pytest.raises(ValueError): reduction(x_cl) else: result = reduction(x_cl) if isinstance(result, cl_array.Array): result = result.get() np.testing.assert_allclose(result, ref) # }}} # {{{ empty with initial if supports_initial: ref = getattr(np, reduction.__name__)(x_np, initial=5.0) result = reduction(x_cl, initial=5.0) if isinstance(result, cl_array.Array): result = result.get() np.testing.assert_allclose(result, ref) # }}} # {{{ non-empty with initial if supports_initial: x_np = np.linspace(-1, 1, 10) x_cl = cl_array.to_device(cq, x_np) ref = getattr(np, reduction.__name__)(x_np, initial=5.0) result = reduction(x_cl, initial=5.0).get() np.testing.assert_allclose(result, ref) ref = getattr(np, reduction.__name__)(x_np, initial=-5.0) result = reduction(x_cl, initial=-5.0).get() np.testing.assert_allclose(result, ref) # }}} # }}} # {{{ test_reduction_nan_handling @pytest.mark.parametrize("with_initial", [False, True]) @pytest.mark.parametrize("input_case", ["only nans", "mixed"]) @pytest.mark.parametrize("reduction", [ cl_array.sum, cl_array.max, cl_array.min, ]) def test_reduction_nan_handling(ctx_factory, reduction, input_case, with_initial): ctx = ctx_factory() cq = cl.CommandQueue(ctx) if input_case == "only nans": x_np = np.array([np.nan, np.nan], dtype=np.float64) elif input_case == "mixed": x_np = np.array([np.nan, 1.], dtype=np.float64) else: raise ValueError("invalid input case") x_cl = cl_array.to_device(cq, x_np) if with_initial: ref = getattr(np, reduction.__name__)(x_np, initial=5.0) result = reduction(x_cl, initial=5.0) else: ref = getattr(np, reduction.__name__)(x_np) result = reduction(x_cl) if isinstance(result, cl_array.Array): result = result.get() np.testing.assert_allclose(result, ref) # }}} # {{{ test_reductions_dtype def test_dtype_conversions(ctx_factory): ctx = ctx_factory() queue = cl.CommandQueue(ctx) ary = cl.array.to_device(queue, np.linspace(0, 1, 32)) for func, nargs, arg_name in [ (cl.array.sum, 1, "dtype"), (cl.array.dot, 2, "dtype"), (cl.array.vdot, 2, "dtype"), (cl.array.cumsum, 1, "output_dtype"), ]: for dtype in [np.float32, np.float64]: result = func(*((ary,) * nargs), **{arg_name: dtype}) assert result.dtype == dtype, result.dtype # }}} # {{{ test_svm_mem_pool_with_arrays @pytest.mark.parametrize("use_mempool", [False, True]) def test_arrays_with_svm_allocators(ctx_factory, use_mempool): context = ctx_factory() queue = cl.CommandQueue(context) queue2 = cl.CommandQueue(context) from pyopencl.characterize import has_coarse_grain_buffer_svm has_cg_svm = has_coarse_grain_buffer_svm(queue.device) if not has_cg_svm: pytest.skip("Need coarse-grained SVM support for this test.") alloc = cl_tools.SVMAllocator(context, queue=queue) if use_mempool: alloc = cl_tools.SVMPool(alloc) def alloc2(size): allocation = alloc(size) allocation.bind_to_queue(queue2) return allocation a_dev = cl_array.arange(queue, 2000, dtype=np.float32, allocator=alloc) b_dev = cl_array.to_device(queue, np.arange(2000), allocator=alloc) + 4000 assert a_dev.allocator is alloc assert b_dev.allocator is alloc assert a_dev.data._queue == queue assert b_dev.data._queue == queue a_dev2 = cl_array.arange(queue2, 2000, dtype=np.float32, allocator=alloc2) b_dev2 = cl_array.to_device(queue2, np.arange(2000), allocator=alloc2) + 4000 assert a_dev2.allocator is alloc2 assert b_dev2.allocator is alloc2 assert a_dev2.data._queue == queue2 assert b_dev2.data._queue == queue2 np.testing.assert_allclose((a_dev+b_dev).get(), (a_dev2+b_dev2).get()) with pytest.warns(cl_array.InconsistentOpenCLQueueWarning): a_dev2.with_queue(queue) # safe to let this proceed to deallocation, since we're not # operating on the memory with pytest.warns(cl_array.InconsistentOpenCLQueueWarning): cl_array.empty(queue2, 2000, np.float32, allocator=alloc) # safe to let this proceed to deallocation, since we're not # operating on the memory # }}} def test_logical_and_or(ctx_factory): # NOTE: Copied over from pycuda/test/test_gpuarray.py rng = np.random.default_rng(seed=0) ctx = ctx_factory() cq = cl.CommandQueue(ctx) for op in ["logical_and", "logical_or"]: x_np = rng.random((10, 4)) y_np = rng.random((10, 4)) zeros_np = np.zeros((10, 4)) ones_np = np.ones((10, 4)) x_cl = cl_array.to_device(cq, x_np) y_cl = cl_array.to_device(cq, y_np) zeros_cl = cl_array.zeros(cq, (10, 4), np.float64) ones_cl = cl_array.zeros(cq, (10, 4), np.float64) + 1 np.testing.assert_array_equal( getattr(cl_array, op)(x_cl, y_cl).get(), getattr(np, op)(x_np, y_np)) np.testing.assert_array_equal( getattr(cl_array, op)(x_cl, ones_cl).get(), getattr(np, op)(x_np, ones_np)) np.testing.assert_array_equal( getattr(cl_array, op)(x_cl, zeros_cl).get(), getattr(np, op)(x_np, zeros_np)) np.testing.assert_array_equal( getattr(cl_array, op)(x_cl, 1.0).get(), getattr(np, op)(x_np, ones_np)) np.testing.assert_array_equal( getattr(cl_array, op)(x_cl, 0.0).get(), getattr(np, op)(x_np, 0.0)) def test_logical_not(ctx_factory): # NOTE: Copied over from pycuda/test/test_gpuarray.py ctx = ctx_factory() cq = cl.CommandQueue(ctx) rng = np.random.default_rng(seed=0) x_np = rng.random((10, 4)) x_cl = cl_array.to_device(cq, x_np) np.testing.assert_array_equal( cl_array.logical_not(x_cl).get(), np.logical_not(x_np)) np.testing.assert_array_equal( cl_array.logical_not(cl_array.zeros(cq, 10, np.float64)).get(), np.logical_not(np.zeros(10))) np.testing.assert_array_equal( cl_array.logical_not(cl_array.zeros(cq, 10, np.float64) + 1).get(), np.logical_not(np.ones(10))) # {{{ test XDG_CACHE_HOME handling @pytest.mark.skipif(sys.platform == "win32", reason="XDG_CACHE_HOME is not used on Windows") def test_xdg_cache_home(ctx_factory): import os import shutil from os.path import join context = ctx_factory() queue = cl.CommandQueue(context) a = np.array([1, 2, 3, 4, 5]).astype(np.float32) a_gpu = cl_array.to_device(queue, a) xdg_dir = "tmpdir_pyopencl_xdg_test" # PyOpenCL uses pytools.PersistentDict for invoker caches, # which is why xdg_dir will always exist. Therefore, check # whether xdg_pyopencl_dir exists. xdg_pyopencl_dir = join(xdg_dir, "pyopencl") assert not os.path.exists(xdg_dir) old_xdg_cache_home = None old_characterize_has_src_build_cache = None try: # Make sure that the source build cache is enabled old_characterize_has_src_build_cache = \ cl.characterize.has_src_build_cache cl.characterize.has_src_build_cache = lambda dev: False old_xdg_cache_home = os.getenv("XDG_CACHE_HOME") os.environ["XDG_CACHE_HOME"] = xdg_dir result = pow(a_gpu, a_gpu).get() assert (np.abs(a ** a - result) < 3e-3).all() assert os.path.exists(xdg_pyopencl_dir) finally: cl.characterize.has_src_build_cache = \ old_characterize_has_src_build_cache if old_xdg_cache_home is not None: os.environ["XDG_CACHE_HOME"] = old_xdg_cache_home else: del os.environ["XDG_CACHE_HOME"] shutil.rmtree(xdg_dir) # }}} def test_numpy_type_promotion_with_cl_arrays(ctx_factory): ctx = ctx_factory() queue = cl.CommandQueue(ctx) class NotReallyAnArray: @property def dtype(self): return np.dtype("float64") # Make sure that np.result_type accesses only the dtype attribute of the # class, not (e.g.) its data. assert np.result_type(42, NotReallyAnArray()) == np.float64 from pyopencl.array import _get_common_dtype assert _get_common_dtype(42, NotReallyAnArray(), queue) == np.float64 assert _get_common_dtype(42.0, NotReallyAnArray(), queue) == np.float64 if __name__ == "__main__": if len(sys.argv) > 1: exec(sys.argv[1]) else: from pytest import main main([__file__]) # vim: fdm=marker pyopencl-2025.1/test/test_arrays_in_structs.py0000644000000000000000000000667614332717401016551 0ustar00__copyright__ = "Copyright (C) 2020 Sotiris Niarchos" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import numpy as np import pyopencl as cl import pyopencl.cltypes as cltypes import pyopencl.tools as cl_tools from pyopencl import mem_flags from pyopencl.tools import ( pytest_generate_tests_for_pyopencl as pytest_generate_tests, # noqa: F401 ) def test_struct_with_array_fields(ctx_factory): # # typedef struct { # uint x[2]; # float y; # uint z[3][4]; # } my_struct; # cl_ctx = ctx_factory() device = cl_ctx.devices[0] queue = cl.CommandQueue(cl_ctx) my_struct = np.dtype([ ("x", cltypes.uint, 2), ("y", cltypes.int), ("z", cltypes.uint, (3, 4)) ]) my_struct, cdecl = cl_tools.match_dtype_to_c_struct( device, "my_struct", my_struct ) # a random buffer of 4 structs my_struct_arr = np.array([ ([81, 24], -57, [[15, 28, 45, 7], [71, 95, 65, 84], [2, 11, 59, 9]]), ([5, 20], 47, [[15, 53, 7, 59], [73, 22, 27, 86], [59, 6, 39, 49]]), ([11, 99], -32, [[73, 83, 4, 65], [19, 21, 22, 27], [1, 55, 6, 64]]), ([57, 38], -54, [[74, 90, 38, 67], [77, 30, 99, 18], [91, 3, 63, 67]]) ], dtype=my_struct) expected_res = [] for x in my_struct_arr: expected_res.append(int(np.sum(x[0]) + x[1] + np.sum(x[2]))) expected_res = np.array(expected_res, dtype=cltypes.int) kernel_src = """%s // this kernel sums every number contained in each struct __kernel void array_structs(__global my_struct *structs, __global int *res) { int i = get_global_id(0); my_struct s = structs[i]; res[i] = s.x[0] + s.x[1] + s.y; for (int r = 0; r < 3; r++) for (int c = 0; c < 4; c++) res[i] += s.z[r][c]; }""" % cdecl mem_flags1 = mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR mem_flags2 = mem_flags.WRITE_ONLY my_struct_buf = cl.Buffer(cl_ctx, mem_flags1, hostbuf=my_struct_arr) res_buf = cl.Buffer(cl_ctx, mem_flags2, size=expected_res.nbytes) program = cl.Program(cl_ctx, kernel_src).build() kernel = program.array_structs kernel(queue, (4,), None, my_struct_buf, res_buf) res = np.empty_like(expected_res) cl.enqueue_copy(queue, res, res_buf) assert (res == expected_res).all() if __name__ == "__main__": import sys if len(sys.argv) > 1: exec(sys.argv[1]) else: from pytest import main main([__file__]) pyopencl-2025.1/test/test_clmath.py0000644000000000000000000003747314332717401014242 0ustar00__copyright__ = "Copyright (C) 2009 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import math import numpy as np import pytest import pyopencl as cl import pyopencl.array as cl_array import pyopencl.clmath as clmath from pyopencl.characterize import has_double_support, has_struct_arg_count_bug from pyopencl.tools import ( pytest_generate_tests_for_pyopencl as pytest_generate_tests, # noqa: F401 ) try: import faulthandler except ImportError: pass else: faulthandler.enable() sizes = [10, 128, 1 << 10, 1 << 11, 1 << 13] numpy_func_names = { "asin": "arcsin", "acos": "arccos", "atan": "arctan", } def make_unary_function_test(name, limits=(0, 1), threshold=0, use_complex=False): (a, b) = limits a = float(a) b = float(b) def test(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) gpu_func = getattr(clmath, name) cpu_func = getattr(np, numpy_func_names.get(name, name)) dev = context.devices[0] if has_double_support(dev): if use_complex and has_struct_arg_count_bug(dev) == "apple": dtypes = [np.float32, np.float64, np.complex64] elif use_complex: dtypes = [np.float32, np.float64, np.complex64, np.complex128] else: dtypes = [np.float32, np.float64] else: if use_complex: dtypes = [np.float32, np.complex64] else: dtypes = [np.float32] for s in sizes: for dtype in dtypes: dtype = np.dtype(dtype) args = cl_array.arange(queue, a, b, (b-a)/s, dtype=dtype) if dtype.kind == "c": # args = args + dtype.type(1j) * args args = args + args * dtype.type(1j) gpu_results = gpu_func(args).get() cpu_results = cpu_func(args.get()) my_threshold = threshold if dtype.kind == "c" and isinstance(use_complex, float): my_threshold = use_complex max_err = np.max(np.abs(cpu_results - gpu_results)) assert (max_err <= my_threshold).all(), \ (max_err, name, dtype) return test test_ceil = make_unary_function_test("ceil", (-10, 10)) test_floor = make_unary_function_test("ceil", (-10, 10)) test_fabs = make_unary_function_test("fabs", (-10, 10)) test_exp = make_unary_function_test("exp", (-3, 3), 1e-5, use_complex=True) test_log = make_unary_function_test("log", (1e-5, 1), 1e-6, use_complex=True) test_log10 = make_unary_function_test("log10", (1e-5, 1), 5e-7) test_sqrt = make_unary_function_test("sqrt", (1e-5, 1), 3e-7, use_complex=True) test_sin = make_unary_function_test("sin", (-10, 10), 2e-7, use_complex=2e-2) test_cos = make_unary_function_test("cos", (-10, 10), 2e-7, use_complex=2e-2) test_asin = make_unary_function_test("asin", (-0.9, 0.9), 5e-7) test_acos = make_unary_function_test("acos", (-0.9, 0.9), 5e-7) test_tan = make_unary_function_test("tan", (-math.pi/2 + 0.1, math.pi/2 - 0.1), 4e-5, use_complex=True) test_atan = make_unary_function_test("atan", (-10, 10), 2e-7) test_sinh = make_unary_function_test("sinh", (-3, 3), 3e-6, use_complex=2e-3) test_cosh = make_unary_function_test("cosh", (-3, 3), 3e-6, use_complex=2e-3) test_tanh = make_unary_function_test("tanh", (-3, 3), 2e-6, use_complex=True) def test_atan2(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) for s in sizes: a = (cl_array.arange(queue, s, dtype=np.float32) - np.float32(s / 2)) / 100 a2 = (s / 2 - 1 - cl_array.arange(queue, s, dtype=np.float32)) / 100 b = clmath.atan2(a, a2) a = a.get() a2 = a2.get() b = b.get() for i in range(s): assert abs(math.atan2(a[i], a2[i]) - b[i]) < 1e-6 def test_atan2pi(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) for s in sizes: a = (cl_array.arange(queue, s, dtype=np.float32) - np.float32(s / 2)) / 100 a2 = (s / 2 - 1 - cl_array.arange(queue, s, dtype=np.float32)) / 100 b = clmath.atan2pi(a, a2) a = a.get() a2 = a2.get() b = b.get() for i in range(s): assert abs(math.atan2(a[i], a2[i]) / math.pi - b[i]) < 1e-6 def test_fmod(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) for s in sizes: a = cl_array.arange(queue, s, dtype=np.float32)/10 a2 = cl_array.arange(queue, s, dtype=np.float32)/45.2 + 0.1 b = clmath.fmod(a, a2) # https://salsa.debian.org/opencl-team/python-pyopencl/-/merge_requests/3#note_383761 assert np.max(np.abs(np.fmod(a.get(), a2.get()) - b.get())) < 1e-4 def test_ldexp(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) for s in sizes: a = cl_array.arange(queue, s, dtype=np.float32) a2 = cl_array.arange(queue, s, dtype=np.float32)*1e-3 b = clmath.ldexp(a, a2) a = a.get() a2 = a2.get() b = b.get() for i in range(s): assert math.ldexp(a[i], int(a2[i])) == b[i] def test_modf(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) for s in sizes: a = cl_array.arange(queue, s, dtype=np.float32)/10 fracpart, intpart = clmath.modf(a) a = a.get() intpart = intpart.get() fracpart = fracpart.get() for i in range(s): fracpart_true, intpart_true = math.modf(a[i]) assert intpart_true == intpart[i] assert abs(fracpart_true - fracpart[i]) < 1e-4 def test_frexp(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) for s in sizes: a = cl_array.arange(queue, s, dtype=np.float32)/10 significands, exponents = clmath.frexp(a) a = a.get() significands = significands.get() exponents = exponents.get() for i in range(s): sig_true, ex_true = math.frexp(a[i]) assert sig_true == significands[i] assert ex_true == exponents[i] def test_bessel(ctx_factory): try: import scipy.special as spec except ImportError: from pytest import skip skip("scipy not present--cannot test Bessel function") ctx = ctx_factory() queue = cl.CommandQueue(ctx) if not has_double_support(ctx.devices[0]): from pytest import skip skip("no double precision support--cannot test bessel function") nterms = 30 try: from pyfmmlib import hank103_vec, jfuns2d except ImportError: use_pyfmmlib = False else: use_pyfmmlib = True print("PYFMMLIB", use_pyfmmlib) if use_pyfmmlib: a = np.logspace(-3, 3, 10**6) else: a = np.logspace(-5, 5, 10**6) for which_func, cl_func, scipy_func, is_rel in [ ("j", clmath.bessel_jn, spec.jn, False), ("y", clmath.bessel_yn, spec.yn, True) ]: if is_rel: def get_err(check, ref): return np.max(np.abs(check-ref)) / np.max(np.abs(ref)) else: def get_err(check, ref): return np.max(np.abs(check-ref)) if use_pyfmmlib: pfymm_result = np.empty((len(a), nterms), dtype=np.complex128) if which_func == "j": for i, a_i in enumerate(a): if i % 100000 == 0: print("%.1f %%" % (100 * i/len(a))) ier, fjs, _, _ = jfuns2d(nterms, a_i, 1, 0, 10000) pfymm_result[i] = fjs[:nterms] assert ier == 0 elif which_func == "y": h0, h1 = hank103_vec(a, ifexpon=1) pfymm_result[:, 0] = h0.imag pfymm_result[:, 1] = h1.imag a_dev = cl_array.to_device(queue, a) for n in range(0, nterms): cl_bessel = cl_func(n, a_dev).get() scipy_bessel = scipy_func(n, a) error_scipy = get_err(cl_bessel, scipy_bessel) assert error_scipy < 1e-10, error_scipy if use_pyfmmlib and ( which_func == "j" or (which_func == "y" and n in [0, 1])): pyfmm_bessel = pfymm_result[:, n] error_pyfmm = get_err(cl_bessel, pyfmm_bessel) assert error_pyfmm < 1e-10, error_pyfmm error_pyfmm_scipy = get_err(scipy_bessel, pyfmm_bessel) print(which_func, n, error_scipy, error_pyfmm, error_pyfmm_scipy) else: print(which_func, n, error_scipy) assert not np.isnan(cl_bessel).any() if 0 and n == 15: import matplotlib.pyplot as pt # pt.plot(scipy_bessel) # pt.plot(cl_bessel) pt.loglog(a, np.abs(cl_bessel-scipy_bessel), label="vs scipy") if use_pyfmmlib: pt.loglog(a, np.abs(cl_bessel-pyfmm_bessel), label="vs pyfmmlib") pt.legend() pt.show() @pytest.mark.parametrize("ref_src", ["pyfmmlib", "scipy"]) def test_complex_bessel(ctx_factory, ref_src): ctx = ctx_factory() queue = cl.CommandQueue(ctx) if not has_double_support(ctx.devices[0]): from pytest import skip skip("no double precision support--cannot test complex bessel function") v = 40 n = 10**5 rng = np.random.default_rng(seed=13) z = ( np.logspace(-5, 2, n) * np.exp(1j * 2 * np.pi * rng.random(n))) if ref_src == "pyfmmlib": pyfmmlib = pytest.importorskip("pyfmmlib") jv_ref = np.zeros(len(z), "complex") vin = v+1 for i in range(len(z)): ier, fjs, _, _ = pyfmmlib.jfuns2d(vin, z[i], 1, 0, 10000) assert ier == 0 jv_ref[i] = fjs[v] elif ref_src == "scipy": spec = pytest.importorskip("scipy.special") jv_ref = spec.jv(v, z) else: raise ValueError("ref_src") z_dev = cl_array.to_device(queue, z) jv_dev = clmath.bessel_jn(v, z_dev) abs_err_jv = np.abs(jv_dev.get() - jv_ref) abs_jv_ref = np.abs(jv_ref) rel_err_jv = abs_err_jv/abs_jv_ref # use absolute error instead if the function value itself is too small tiny = 1e-300 ind = abs_jv_ref < tiny rel_err_jv[ind] = abs_err_jv[ind] # if the reference value is inf or nan, set the error to zero ind1 = np.isinf(abs_jv_ref) ind2 = np.isnan(abs_jv_ref) rel_err_jv[ind1] = 0 rel_err_jv[ind2] = 0 if 0: print(abs(z)) print(np.abs(jv_ref)) print(np.abs(jv_dev.get())) print(rel_err_jv) max_err = np.max(rel_err_jv) assert max_err <= 2e-13, max_err print("Jv", np.max(rel_err_jv)) if 0: import matplotlib.pyplot as pt pt.loglog(np.abs(z), rel_err_jv) pt.show() @pytest.mark.parametrize("ref_src", ["pyfmmlib", "scipy"]) def test_hankel_01_complex(ctx_factory, ref_src): ctx = ctx_factory() queue = cl.CommandQueue(ctx) if not has_double_support(ctx.devices[0]): from pytest import skip skip("no double precision support--cannot test complex bessel function") rng = np.random.default_rng(seed=11) n = 10**6 z = ( np.logspace(-5, 2, n) * np.exp(1j * 2 * np.pi * rng.random(n))) def get_err(check, ref): return np.max(np.abs(check-ref)) / np.max(np.abs(ref)) if ref_src == "pyfmmlib": pyfmmlib = pytest.importorskip("pyfmmlib") h0_ref, h1_ref = pyfmmlib.hank103_vec(z, ifexpon=1) elif ref_src == "scipy": spec = pytest.importorskip("scipy.special") h0_ref = spec.hankel1(0, z) h1_ref = spec.hankel1(1, z) else: raise ValueError("ref_src") z_dev = cl_array.to_device(queue, z) h0_dev, h1_dev = clmath.hankel_01(z_dev) rel_err_h0 = np.abs(h0_dev.get() - h0_ref)/np.abs(h0_ref) rel_err_h1 = np.abs(h1_dev.get() - h1_ref)/np.abs(h1_ref) max_rel_err_h0 = np.max(rel_err_h0) max_rel_err_h1 = np.max(rel_err_h1) print("H0", max_rel_err_h0) print("H1", max_rel_err_h1) assert max_rel_err_h0 < 4e-13 assert max_rel_err_h1 < 2e-13 if 0: import matplotlib.pyplot as pt pt.loglog(np.abs(z), rel_err_h0) pt.loglog(np.abs(z), rel_err_h1) pt.show() @pytest.mark.parametrize("dtype", [np.complex64, np.complex128]) def test_complex_muladd(ctx_factory, dtype): ctx = ctx_factory() queue = cl.CommandQueue(ctx) if dtype == np.complex128 and not has_double_support(ctx.devices[0]): from pytest import skip skip("no double precision support") if dtype == np.complex64: real_type = np.float32 real_type_name = "float" else: real_type = np.float64 real_type_name = "double" rng = np.random.default_rng(seed=11) n = 100 arrs = [rng.random(n, dtype=real_type) + 1j*rng.random(n, dtype=real_type) for i in range(3)] arrs = [arr.astype(dtype) for arr in arrs] arrs_dev = [cl_array.to_device(queue, arr) for arr in arrs] prg_str = """ #if __OPENCL_C_VERSION__ < 120 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif #define PYOPENCL_DEFINE_CDOUBLE #include __kernel void foo( __global const c{real_type_name}_t *a, __global const c{real_type_name}_t *b, __global const c{real_type_name}_t *c, __global c{real_type_name}_t *res ) {{ int gid = get_global_id(0); res[gid] = c{real_type_name}_fma(a[gid], b[gid], c[gid]); }} """.format(real_type_name=real_type_name) prg = cl.Program(ctx, prg_str).build() knl = prg.foo result_dev = cl_array.empty_like(arrs_dev[0]) knl(queue, (n,), None, arrs_dev[0].data, arrs_dev[1].data, arrs_dev[2].data, result_dev.data) ref = arrs[0] * arrs[1] + arrs[2] rel_err = np.abs(result_dev.get() - ref)/np.abs(ref) if dtype == np.complex64: assert np.max(rel_err) < 1e-6 else: assert np.max(rel_err) < 1e-12 def test_outoforderqueue_clmath(ctx_factory): context = ctx_factory() try: queue = cl.CommandQueue(context, properties=cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE) except Exception: pytest.skip("out-of-order queue not available") rng = np.random.default_rng(seed=42) a = rng.random(10**6, dtype=np.float32) a_gpu = cl_array.to_device(queue, a) # testing that clmath functions wait for and create events b_gpu = clmath.fabs(clmath.sin(a_gpu * 5)) queue.finish() b1 = b_gpu.get() b = np.abs(np.sin(a * 5)) assert np.abs(b1 - b).mean() < 1e-5 if __name__ == "__main__": import sys if len(sys.argv) > 1: exec(sys.argv[1]) else: from pytest import main main([__file__]) pyopencl-2025.1/test/test_clrandom.py0000644000000000000000000000463714332717401014565 0ustar00__copyright__ = "Copyright (C) 2018 Matt Wala" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import numpy as np import pytest import pyopencl as cl import pyopencl.clrandom as clrandom import pyopencl.cltypes as cltypes from pyopencl.characterize import has_double_support from pyopencl.tools import ( pytest_generate_tests_for_pyopencl as pytest_generate_tests, # noqa: F401 ) try: import faulthandler except ImportError: pass else: faulthandler.enable() @pytest.mark.parametrize("rng_class", [ clrandom.PhiloxGenerator, clrandom.ThreefryGenerator]) @pytest.mark.parametrize("dtype", [ np.int32, np.int64, np.float32, np.float64, cltypes.float2, # type: ignore[attr-defined] cltypes.float3, # type: ignore[attr-defined] cltypes.float4, # type: ignore[attr-defined] ]) def test_clrandom_dtypes(ctx_factory, rng_class, dtype): cl_ctx = ctx_factory() if dtype == np.float64 and not has_double_support(cl_ctx.devices[0]): pytest.skip("double precision not supported on this device") rng = rng_class(cl_ctx) size = 10 with cl.CommandQueue(cl_ctx) as queue: rng.uniform(queue, size, dtype) if dtype not in (np.int32, np.int64): rng.normal(queue, size, dtype) if __name__ == "__main__": import sys if len(sys.argv) > 1: exec(sys.argv[1]) else: from pytest import main main([__file__]) pyopencl-2025.1/test/test_enqueue_copy.py0000644000000000000000000002306114332717401015457 0ustar00#! /usr/bin/env python __copyright__ = "Copyright (C) 2016 Shane J. Latham" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import numpy as np import pytest import pyopencl as cl from pyopencl.characterize import get_pocl_version from pyopencl.tools import ( pytest_generate_tests_for_pyopencl as pytest_generate_tests, # noqa: F401 ) def generate_slice(start, shape): return tuple(slice(start[i], start[i]+shape[i]) for i in range(len(start))) def test_enqueue_copy_rect_2d(ctx_factory, honor_skip=True): """ Test 2D sub-array (slice) copy. """ ctx = ctx_factory() queue = cl.CommandQueue(ctx) if (honor_skip and ctx.devices[0].platform.name == "Portable Computing Language" and get_pocl_version(ctx.devices[0].platform) <= (0, 13)): # https://github.com/pocl/pocl/issues/353 pytest.skip("PoCL's rectangular copies crash") device = queue.device if device.platform.vendor == "The pocl project" \ and device.type & cl.device_type.GPU: pytest.xfail("rect copies fail on PoCL + Nvidia," "at least the K40, as of PoCL 1.6, 2021-01-20") if honor_skip and queue.device.platform.name == "Apple": pytest.xfail("Apple's CL implementation crashes on this.") ary_in_shp = 256, 128 # Entire array shape from which sub-array copied to device sub_ary_shp = 128, 96 # Sub-array shape to be copied to device ary_in_origin = 20, 13 # Sub-array origin ary_in_slice = generate_slice(ary_in_origin, sub_ary_shp) ary_out_origin = 11, 19 # Origin of sub-array copy from device to host-array ary_out_shp = 512, 256 # Entire host-array shape copy sub-array device->host ary_out_slice = generate_slice(ary_out_origin, sub_ary_shp) buf_in_origin = 7, 3 # Origin of sub-array in device buffer buf_in_shp = 300, 200 # shape of device buffer buf_out_origin = 31, 17 # Origin of 2nd device buffer buf_out_shp = 300, 400 # shape of 2nd device buffer # Create host array of random values. rng = np.random.default_rng(seed=42) h_ary_in = rng.integers(0, 256, ary_in_shp, dtype=np.uint8) # Create device buffers d_in_buf = cl.Buffer(ctx, cl.mem_flags.READ_ONLY, size=np.prod(buf_in_shp)) d_out_buf = cl.Buffer(ctx, cl.mem_flags.READ_ONLY, size=np.prod(buf_out_shp)) # Copy sub-array (rectangular buffer) from host to device cl.enqueue_copy( queue, d_in_buf, h_ary_in, buffer_origin=buf_in_origin[::-1], host_origin=ary_in_origin[::-1], region=sub_ary_shp[::-1], buffer_pitches=(buf_in_shp[-1],), host_pitches=(ary_in_shp[-1],) ) # Copy sub-array (rectangular buffer) from device-buffer to device-buffer cl.enqueue_copy( queue, d_out_buf, d_in_buf, src_origin=buf_in_origin[::-1], dst_origin=buf_out_origin[::-1], region=sub_ary_shp[::-1], src_pitches=(buf_in_shp[-1],), dst_pitches=(buf_out_shp[-1],) ) # Create zero-initialised array to receive sub-array from device h_ary_out = np.zeros(ary_out_shp, dtype=h_ary_in.dtype) # Copy sub-array (rectangular buffer) from device to host-array. cl.enqueue_copy( queue, h_ary_out, d_out_buf, buffer_origin=buf_out_origin[::-1], host_origin=ary_out_origin[::-1], region=sub_ary_shp[::-1], buffer_pitches=(buf_out_shp[-1],), host_pitches=(ary_out_shp[-1],) ) queue.finish() # Check that the sub-array copied to device is # the same as the sub-array received from device. assert np.all(h_ary_in[ary_in_slice] == h_ary_out[ary_out_slice]) def test_enqueue_copy_rect_3d(ctx_factory, honor_skip=True): """ Test 3D sub-array (slice) copy. """ ctx = ctx_factory() queue = cl.CommandQueue(ctx) if (honor_skip and ctx.devices[0].platform.name == "Portable Computing Language" and get_pocl_version(ctx.devices[0].platform) <= (0, 13)): # https://github.com/pocl/pocl/issues/353 pytest.skip("PoCL's rectangular copies crash") device = queue.device if device.platform.vendor == "The pocl project" \ and device.type & cl.device_type.GPU: pytest.xfail("rect copies fail on PoCL + Nvidia," "at least the K40, as of PoCL 1.6, 2021-01-20") if honor_skip and queue.device.platform.name == "Apple": pytest.skip("Apple's CL implementation crashes on this.") ary_in_shp = 256, 128, 31 # array shape from which sub-array copied to device sub_ary_shp = 128, 96, 20 # Sub-array shape to be copied to device ary_in_origin = 20, 13, 7 # Sub-array origin ary_in_slice = generate_slice(ary_in_origin, sub_ary_shp) ary_out_origin = 11, 19, 14 # Origin of sub-array copy from device to host-array ary_out_shp = 192, 256, 128 # Entire host-array shape copy sub-array dev->host ary_out_slice = generate_slice(ary_out_origin, sub_ary_shp) buf_in_origin = 7, 3, 6 # Origin of sub-array in device buffer buf_in_shp = 300, 200, 30 # shape of device buffer buf_out_origin = 31, 17, 3 # Origin of 2nd device buffer buf_out_shp = 300, 400, 40 # shape of 2nd device buffer # Create host array of random values. rng = np.random.default_rng(seed=42) h_ary_in = rng.integers(0, 256, ary_in_shp, dtype=np.uint8) # Create device buffers d_in_buf = cl.Buffer(ctx, cl.mem_flags.READ_ONLY, size=np.prod(buf_in_shp)) d_out_buf = cl.Buffer(ctx, cl.mem_flags.READ_ONLY, size=np.prod(buf_out_shp)) # Copy sub-array (rectangular buffer) from host to device cl.enqueue_copy( queue, d_in_buf, h_ary_in, buffer_origin=buf_in_origin[::-1], host_origin=ary_in_origin[::-1], region=sub_ary_shp[::-1], buffer_pitches=(buf_in_shp[-1], buf_in_shp[-1]*buf_in_shp[-2]), host_pitches=(ary_in_shp[-1], ary_in_shp[-1]*ary_in_shp[-2]) ) # Copy sub-array (rectangular buffer) from device-buffer to device-buffer cl.enqueue_copy( queue, d_out_buf, d_in_buf, src_origin=buf_in_origin[::-1], dst_origin=buf_out_origin[::-1], region=sub_ary_shp[::-1], src_pitches=(buf_in_shp[-1], buf_in_shp[-1]*buf_in_shp[-2]), dst_pitches=(buf_out_shp[-1], buf_out_shp[-1]*buf_out_shp[-2]) ) # Create zero-initialised array to receive sub-array from device h_ary_out = np.zeros(ary_out_shp, dtype=h_ary_in.dtype) # Copy sub-array (rectangular buffer) from device to host-array. cl.enqueue_copy( queue, h_ary_out, d_out_buf, buffer_origin=buf_out_origin[::-1], host_origin=ary_out_origin[::-1], region=sub_ary_shp[::-1], buffer_pitches=(buf_out_shp[-1], buf_out_shp[-1]*buf_out_shp[-2]), host_pitches=(ary_out_shp[-1], ary_out_shp[-1]*ary_out_shp[-2]) ) queue.finish() # Check that the sub-array copied to device is # the same as the sub-array received from device. assert np.array_equal(h_ary_in[ary_in_slice], h_ary_out[ary_out_slice]) def test_enqueue_copy_buffer_p2p_amd(honor_skip=True): platform = cl.get_platforms()[0] if honor_skip and platform.vendor != "Advanced Micro Devices, Inc.": pytest.skip("AMD-specific test") devices = platform.get_devices() if len(devices) < 2: pytest.skip("Need at least two devices") ctx1 = cl.Context([devices[0]]) ctx2 = cl.Context([devices[1]]) queue1 = cl.CommandQueue(ctx1) queue2 = cl.CommandQueue(ctx2) ary_shp = 256, 128, 32 # array shape # Create host array of random values. rng = np.random.default_rng(seed=42) h_ary = rng.integers(0, 256, ary_shp, dtype=np.uint8) # Create device buffers d_buf1 = cl.Buffer(ctx1, cl.mem_flags.READ_WRITE, size=np.prod(ary_shp)) d_buf2 = cl.Buffer(ctx2, cl.mem_flags.READ_WRITE, size=np.prod(ary_shp)) # Copy array from host to device cl.enqueue_copy(queue1, d_buf1, h_ary) # Copy array from device to device cl.enqueue_copy_buffer_p2p_amd( platform, queue1, d_buf1, d_buf2, np.prod(ary_shp) ) queue1.finish() # Create zero-initialised array to receive array from device h_ary_out = np.zeros(ary_shp, dtype=h_ary.dtype) # Copy array from device to host cl.enqueue_copy(queue2, h_ary_out, d_buf2) queue2.finish() # Check that the arrays are the same assert np.array_equal(h_ary, h_ary_out) if __name__ == "__main__": import sys if len(sys.argv) > 1: exec(sys.argv[1]) else: from pytest import main main([__file__]) pyopencl-2025.1/test/test_wrapper.py0000644000000000000000000012766314332717401014453 0ustar00__copyright__ = "Copyright (C) 2009 Andreas Kloeckner" __license__ = """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ import numpy as np import numpy.linalg as la import pytest import pyopencl as cl import pyopencl.array as cl_array import pyopencl.clrandom import pyopencl.cltypes as cltypes from pyopencl.characterize import get_pocl_version from pyopencl.tools import ( DeferredAllocator, ImmediateAllocator, pytest_generate_tests_for_pyopencl as pytest_generate_tests, # noqa: F401 ) def _xfail_if_pocl(plat, up_to_version, msg="unsupported by PoCL"): if plat.vendor == "The pocl project": if up_to_version is None or get_pocl_version(plat) <= up_to_version: pytest.xfail(msg) def _xfail_if_pocl_gpu(device, what): if device.platform.vendor == "The pocl project" \ and device.type & cl.device_type.GPU: pytest.xfail(f"PoCL's {what} support don't work right on Nvidia GPUs, " "at least the Titan V, as of PoCL 1.6, 2021-01-20") # {{{ test_get_info def test_get_info(ctx_factory): ctx = ctx_factory() device, = ctx.devices platform = device.platform with pytest.deprecated_call(): device.persistent_unique_id # noqa: B018 device.hashable_model_and_version_identifier # noqa: B018 failure_count = [0] pocl_quirks = [ (cl.Buffer, cl.mem_info.OFFSET), (cl.Program, cl.program_info.BINARIES), (cl.Program, cl.program_info.BINARY_SIZES), ] if ctx._get_cl_version() >= (1, 2) and cl.get_cl_header_version() >= (1, 2): pocl_quirks.extend([ (cl.Program, cl.program_info.KERNEL_NAMES), (cl.Program, cl.program_info.NUM_KERNELS), ]) CRASH_QUIRKS = [ # noqa: N806 (("NVIDIA Corporation", "NVIDIA CUDA", "OpenCL 1.0 CUDA 3.0.1"), [ (cl.Event, cl.event_info.COMMAND_QUEUE), ]), (("NVIDIA Corporation", "NVIDIA CUDA", "OpenCL 1.2 CUDA 7.5"), [ (cl.Buffer, getattr(cl.mem_info, "USES_SVM_POINTER", None)), ]), (("The pocl project", "Portable Computing Language", "OpenCL 1.2 pocl 0.8-pre"), pocl_quirks), (("The pocl project", "Portable Computing Language", "OpenCL 1.2 pocl 0.8"), pocl_quirks), (("The pocl project", "Portable Computing Language", "OpenCL 1.2 pocl 0.9-pre"), pocl_quirks), (("The pocl project", "Portable Computing Language", "OpenCL 1.2 pocl 0.9"), pocl_quirks), (("The pocl project", "Portable Computing Language", "OpenCL 1.2 pocl 0.10-pre"), pocl_quirks), (("The pocl project", "Portable Computing Language", "OpenCL 1.2 pocl 0.10"), pocl_quirks), (("Apple", "Apple", "OpenCL 1.2"), [ (cl.Program, cl.program_info.SOURCE), ]), ] QUIRKS = [] # noqa: N806 def find_quirk(quirk_list, cl_obj, info): for (vendor, name, version), quirks in quirk_list: if ( vendor == platform.vendor and name == platform.name and platform.version.startswith(version)): for quirk_cls, quirk_info in quirks: if (isinstance(cl_obj, quirk_cls) and quirk_info == info): return True return False def do_test(cl_obj, info_cls, func=None, try_attr_form=True): if func is None: func = cl_obj.get_info for info_name in dir(info_cls): if not info_name.startswith("_") and info_name != "to_string": print(info_cls, info_name) info = getattr(info_cls, info_name) if find_quirk(CRASH_QUIRKS, cl_obj, info): print("not executing get_info", type(cl_obj), info_name) print("(known crash quirk for %s)" % platform.name) continue try: func(info) except Exception: msg = "failed get_info", type(cl_obj), info_name if find_quirk(QUIRKS, cl_obj, info): msg += ("(known quirk for %s)" % platform.name) else: failure_count[0] += 1 if try_attr_form: try: getattr(cl_obj, info_name.lower()) except Exception: print("failed attr-based get_info", type(cl_obj), info_name) if find_quirk(QUIRKS, cl_obj, info): print("(known quirk for %s)" % platform.name) else: failure_count[0] += 1 do_test(platform, cl.platform_info) do_test(device, cl.device_info) do_test(ctx, cl.context_info) props = 0 if (device.queue_properties & cl.command_queue_properties.PROFILING_ENABLE): profiling = True props = cl.command_queue_properties.PROFILING_ENABLE else: profiling = False queue = cl.CommandQueue(ctx, properties=props) do_test(queue, cl.command_queue_info) prg = cl.Program(ctx, """ __kernel void sum(__global float *a) { a[get_global_id(0)] *= 2; } """).build() do_test(prg, cl.program_info) do_test(prg, cl.program_build_info, lambda info: prg.get_build_info(device, info), try_attr_form=False) n = 2000 a_buf = cl.Buffer(ctx, 0, n*4) do_test(a_buf, cl.mem_info) kernel = prg.all_kernels()[0] do_test(kernel, cl.kernel_info) for _i in range(2): # exercise cache for info_name in dir(cl.kernel_work_group_info): if not info_name.startswith("_") and info_name != "to_string": try: print("kernel_wg_info: %s" % info_name) kernel.get_work_group_info( getattr(cl.kernel_work_group_info, info_name), device) except cl.LogicError as err: print("" % err) evt = kernel(queue, (n,), None, a_buf) do_test(evt, cl.event_info) if profiling: evt.wait() do_test(evt, cl.profiling_info, lambda info: evt.get_profiling_info(info), try_attr_form=False) # crashes on intel... # and pocl does not support CL_ADDRESS_CLAMP if device.image_support and platform.vendor not in [ "Intel(R) Corporation", "The pocl project", ]: smp = cl.Sampler(ctx, False, cl.addressing_mode.CLAMP, cl.filter_mode.NEAREST) do_test(smp, cl.sampler_info) img_format = cl.get_supported_image_formats( ctx, cl.mem_flags.READ_ONLY, cl.mem_object_type.IMAGE2D)[0] img = cl.Image(ctx, cl.mem_flags.READ_ONLY, img_format, (128, 256)) assert img.shape == (128, 256) img.depth # noqa: B018 img.image.depth # noqa: B018 do_test(img, cl.image_info, lambda info: img.get_image_info(info)) # }}} # {{{ test_int_ptr def test_int_ptr(ctx_factory): def do_test(obj): new_obj = type(obj).from_int_ptr(obj.int_ptr) assert obj == new_obj assert type(obj) is type(new_obj) ctx = ctx_factory() device, = ctx.devices platform = device.platform do_test(device) do_test(platform) do_test(ctx) queue = cl.CommandQueue(ctx) do_test(queue) evt = cl.enqueue_marker(queue) do_test(evt) prg = cl.Program(ctx, """ __kernel void sum(__global float *a) { a[get_global_id(0)] *= 2; } """).build() do_test(prg) do_test(prg.sum) n = 2000 a_buf = cl.Buffer(ctx, 0, n*4) do_test(a_buf) # crashes on intel... # and pocl does not support CL_ADDRESS_CLAMP if device.image_support and platform.vendor not in [ "Intel(R) Corporation", "The pocl project", ]: smp = cl.Sampler(ctx, False, cl.addressing_mode.CLAMP, cl.filter_mode.NEAREST) do_test(smp) img_format = cl.get_supported_image_formats( ctx, cl.mem_flags.READ_ONLY, cl.mem_object_type.IMAGE2D)[0] img = cl.Image(ctx, cl.mem_flags.READ_ONLY, img_format, (128, 256)) do_test(img) # }}} # {{{ test_invalid_kernel_names_cause_failures def test_invalid_kernel_names_cause_failures(ctx_factory): ctx = ctx_factory() device = ctx.devices[0] prg = cl.Program(ctx, """ __kernel void sum(__global float *a) { a[get_global_id(0)] *= 2; } """).build() try: prg.sam # noqa: B018 raise RuntimeError("invalid kernel name did not cause error") except AttributeError: pass except RuntimeError: if "Intel" in device.platform.vendor: from pytest import xfail xfail("weird exception from OpenCL implementation " "on invalid kernel name--are you using " "Intel's implementation? (if so, known bug in Intel CL)") else: raise # }}} # {{{ test_image_format_constructor def test_image_format_constructor(): # doesn't need image support to succeed iform = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.FLOAT) assert iform.channel_order == cl.channel_order.RGBA assert iform.channel_data_type == cl.channel_type.FLOAT if not cl._PYPY: assert not hasattr(iform, "__dict__") # }}} # {{{ test_device_topology_amd_constructor def test_device_topology_amd_constructor(): # doesn't need cl_amd_device_attribute_query support to succeed topol = cl.DeviceTopologyAmd(3, 4, 5) assert topol.bus == 3 assert topol.device == 4 assert topol.function == 5 if not cl._PYPY: assert not hasattr(topol, "__dict__") # }}} # {{{ test_nonempty_supported_image_formats def test_nonempty_supported_image_formats(ctx_factory): context = ctx_factory() device = context.devices[0] if device.image_support: assert len(cl.get_supported_image_formats( context, cl.mem_flags.READ_ONLY, cl.mem_object_type.IMAGE2D)) > 0 else: from pytest import skip skip("images not supported on %s" % device.name) # }}} # {{{ test_that_python_args_fail def test_that_python_args_fail(ctx_factory): context = ctx_factory() prg = cl.Program(context, """ __kernel void mult(__global float *a, float b, int c) { a[get_global_id(0)] *= (b+c); } """).build() rng = np.random.default_rng(seed=42) a = rng.random(50000) queue = cl.CommandQueue(context) mf = cl.mem_flags a_buf = cl.Buffer(context, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a) knl = cl.Kernel(prg, "mult") try: knl(queue, a.shape, None, a_buf, 2, 3) raise AssertionError( "PyOpenCL should not accept bare Python types as arguments") except cl.LogicError: pass try: prg.mult(queue, a.shape, None, a_buf, float(2), 3) raise AssertionError( "PyOpenCL should not accept bare Python types as arguments") except cl.LogicError: pass prg.mult(queue, a.shape, None, a_buf, np.float32(2), np.int32(3)) a_result = np.empty_like(a) cl.enqueue_copy(queue, a_buf, a_result).wait() # }}} # {{{ test_image_2d def test_image_2d(ctx_factory): context = ctx_factory() device, = context.devices if not device.image_support: from pytest import skip skip("images not supported on %s" % device) if "Intel" in device.vendor and "31360.31426" in device.version: from pytest import skip skip("images crashy on %s" % device) _xfail_if_pocl(device.platform, None, "PoCL does not support CL_ADDRESS_CLAMP") prg = cl.Program(context, """ __kernel void copy_image( __global float *dest, __read_only image2d_t src, sampler_t samp, int stride0) { int d0 = get_global_id(0); int d1 = get_global_id(1); /* const sampler_t samp = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST; */ dest[d0*stride0 + d1] = read_imagef(src, samp, (float2)(d1, d0)).x; } """).build() num_channels = 1 rng = np.random.default_rng(seed=42) a = rng.random((1024, 512, num_channels), dtype=np.float32) if num_channels == 1: a = a[:, :, 0] queue = cl.CommandQueue(context) try: a_img = cl.image_from_array(context, a, num_channels) except cl.RuntimeError: import sys exc = sys.exc_info()[1] if exc.code == cl.status_code.IMAGE_FORMAT_NOT_SUPPORTED: from pytest import skip skip("required image format not supported on %s" % device.name) else: raise a_dest = cl.Buffer(context, cl.mem_flags.READ_WRITE, a.nbytes) samp = cl.Sampler(context, False, cl.addressing_mode.CLAMP, cl.filter_mode.NEAREST) prg.copy_image(queue, a.shape, None, a_dest, a_img, samp, np.int32(a.strides[0]/a.dtype.itemsize)) a_result = np.empty_like(a) cl.enqueue_copy(queue, a_result, a_dest) good = la.norm(a_result - a) == 0 if not good: if queue.device.type & cl.device_type.CPU: assert good, ("The image implementation on your CPU CL platform '%s' " "returned bad values. This is bad, but common." % queue.device.platform) else: assert good # }}} # {{{ test_image_3d def test_image_3d(ctx_factory): # test for image_from_array for 3d image of float2 context = ctx_factory() device, = context.devices if not device.image_support: from pytest import skip skip("images not supported on %s" % device) if device.platform.vendor == "Intel(R) Corporation": from pytest import skip skip("images crashy on %s" % device) _xfail_if_pocl(device.platform, None, "PoCL does not support CL_ADDRESS_CLAMP") prg = cl.Program(context, """ __kernel void copy_image_plane( __global float2 *dest, __read_only image3d_t src, sampler_t samp, int stride0, int stride1) { int d0 = get_global_id(0); int d1 = get_global_id(1); int d2 = get_global_id(2); /* const sampler_t samp = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST; */ dest[d0*stride0 + d1*stride1 + d2] = read_imagef( src, samp, (float4)(d2, d1, d0, 0)).xy; } """).build() num_channels = 2 shape = (3, 4, 2) rng = np.random.default_rng(seed=42) a = rng.random(size=(*shape, num_channels), dtype=np.float32) queue = cl.CommandQueue(context) try: a_img = cl.image_from_array(context, a, num_channels) except cl.RuntimeError: import sys exc = sys.exc_info()[1] if exc.code == cl.status_code.IMAGE_FORMAT_NOT_SUPPORTED: from pytest import skip skip("required image format not supported on %s" % device.name) else: raise a_dest = cl.Buffer(context, cl.mem_flags.READ_WRITE, a.nbytes) samp = cl.Sampler(context, False, cl.addressing_mode.CLAMP, cl.filter_mode.NEAREST) prg.copy_image_plane(queue, shape, None, a_dest, a_img, samp, np.int32(a.strides[0]/a.itemsize/num_channels), np.int32(a.strides[1]/a.itemsize/num_channels), ) a_result = np.empty_like(a) cl.enqueue_copy(queue, a_result, a_dest) good = la.norm(a_result - a) == 0 if not good: if queue.device.type & cl.device_type.CPU: assert good, ("The image implementation on your CPU CL platform '%s' " "returned bad values. This is bad, but common." % queue.device.platform) else: assert good # }}} # {{{ test_copy_buffer def test_copy_buffer(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) mf = cl.mem_flags rng = np.random.default_rng(seed=42) a = rng.random(50000, dtype=np.float32) b = np.empty_like(a) buf1 = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) buf2 = cl.Buffer(context, mf.WRITE_ONLY, b.nbytes) cl.enqueue_copy(queue, buf2, buf1).wait() cl.enqueue_copy(queue, b, buf2).wait() assert la.norm(a - b) == 0 # }}} # {{{ test_mempool_* def test_mempool(ctx_factory): from pyopencl.tools import ImmediateAllocator, MemoryPool context = ctx_factory() queue = cl.CommandQueue(context) pool = MemoryPool(ImmediateAllocator(queue)) alloc_queue = [] e0 = 12 for e in range(e0-6, e0-4): for _i in range(100): alloc_queue.append(pool.allocate(1 << e)) if len(alloc_queue) > 10: alloc_queue.pop(0) del alloc_queue pool.stop_holding() def test_mempool_2(ctx_factory): from random import randrange from pyopencl.tools import ImmediateAllocator, MemoryPool context = ctx_factory() queue = cl.CommandQueue(context) pool = MemoryPool(ImmediateAllocator(queue)) for s in [randrange(1 << 31) >> randrange(32) for _ in range(2000)] + [2**30]: bin_nr = pool.bin_number(s) asize = pool.alloc_size(bin_nr) assert asize >= s, s assert pool.bin_number(asize) == bin_nr, s assert asize < asize*(1+1/8) def test_mempool_32bit_issues(): import struct if struct.calcsize("@P") * 8 < 64: pytest.skip("only relevant on 64-bit systems") # https://github.com/inducer/pycuda/issues/282 from pyopencl._cl import _TestMemoryPool pool = _TestMemoryPool() for i in [30, 31, 32, 33, 34]: for offs in range(-5, 5): pool.allocate(2**i + offs) # }}} # {{{ test_allocator @pytest.mark.parametrize("allocator_cls", [ImmediateAllocator, DeferredAllocator]) def test_allocator(ctx_factory, allocator_cls): context = ctx_factory() queue = cl.CommandQueue(context) if allocator_cls is DeferredAllocator: allocator = allocator_cls(context) else: allocator = allocator_cls(queue) mem = allocator(15) mem2 = allocator(0) assert mem is not None assert mem2 is None # }}} # {{{ test_vector_args def test_vector_args(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) prg = cl.Program(context, """ __kernel void set_vec(float4 x, __global float4 *dest) { dest[get_global_id(0)] = x; } """).build() x = cltypes.make_float4(1, 2, 3, 4) dest = np.empty(50000, cltypes.float4) mf = cl.mem_flags dest_buf = cl.Buffer(context, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=dest) prg.set_vec(queue, dest.shape, None, x, dest_buf) cl.enqueue_copy(queue, dest, dest_buf).wait() assert (dest == x).all() # }}} # {{{ test_header_dep_handling def test_header_dep_handling(ctx_factory): context = ctx_factory() from os.path import dirname, exists, join assert exists(join(dirname(__file__), "empty-header.h")) kernel_src = """ #include kernel void zonk(global int *a) { *a = 5; } """ cl.Program(context, kernel_src).build(["-I", dirname(__file__)]) cl.Program(context, kernel_src).build(["-I", dirname(__file__)]) # }}} # {{{ test_context_dep_memoize def test_context_dep_memoize(ctx_factory): context = ctx_factory() from pyopencl.tools import context_dependent_memoize counter = [0] @context_dependent_memoize def do_something(ctx): counter[0] += 1 do_something(context) do_something(context) assert counter[0] == 1 # }}} # {{{ test_can_build_and_run_binary def test_can_build_and_run_binary(ctx_factory): ctx = ctx_factory() queue = cl.CommandQueue(ctx) device = queue.device program = cl.Program(ctx, """ __kernel void simple(__global float *in, __global float *out) { out[get_global_id(0)] = in[get_global_id(0)]; }""") program.build() binary = program.get_info(cl.program_info.BINARIES)[0] foo = cl.Program(ctx, [device], [binary]) foo.build() n = 256 a_dev = cl.clrandom.rand(queue, n, np.float32) dest_dev = cl_array.empty_like(a_dev) foo.simple(queue, (n,), (16,), a_dev.data, dest_dev.data) # }}} # {{{ test_enqueue_barrier_marker def test_enqueue_barrier_marker(ctx_factory): ctx = ctx_factory() # Still relevant on PoCL 1.0RC1. _xfail_if_pocl( ctx.devices[0].platform, (1, 0), "PoCL crashes on enqueue_barrier") queue = cl.CommandQueue(ctx) if queue._get_cl_version() >= (1, 2) and cl.get_cl_header_version() <= (1, 1): pytest.skip("CL impl version >= 1.2, header version <= 1.1--cannot be sure " "that clEnqueueWaitForEvents is implemented") cl.enqueue_barrier(queue) evt1 = cl.enqueue_marker(queue) evt2 = cl.enqueue_marker(queue, wait_for=[evt1]) cl.enqueue_barrier(queue, wait_for=[evt1, evt2]) # }}} # {{{ test_wait_for_events def test_wait_for_events(ctx_factory): ctx = ctx_factory() queue = cl.CommandQueue(ctx) evt1 = cl.enqueue_marker(queue) evt2 = cl.enqueue_marker(queue) cl.wait_for_events([evt1, evt2]) # }}} # {{{ test_unload_compiler def test_unload_compiler(platform): if (platform._get_cl_version() < (1, 2) or cl.get_cl_header_version() < (1, 2)): from pytest import skip skip("clUnloadPlatformCompiler is only available in OpenCL 1.2") _xfail_if_pocl(platform, (0, 13), "PoCL does not support unloading compiler") if platform.vendor == "Intel(R) Corporation": from pytest import skip skip("Intel proprietary driver does not support unloading compiler") cl.unload_platform_compiler(platform) # }}} # {{{ test_platform_get_devices def test_platform_get_devices(ctx_factory): ctx = ctx_factory() platform = ctx.devices[0].platform if platform.name == "Apple": pytest.xfail("Apple doesn't understand all the values we pass " "for dev_type") dev_types = [cl.device_type.ACCELERATOR, cl.device_type.ALL, cl.device_type.CPU, cl.device_type.DEFAULT, cl.device_type.GPU] if (platform._get_cl_version() >= (1, 2) and cl.get_cl_header_version() >= (1, 2) and not platform.name.lower().startswith("nvidia")): dev_types.append(cl.device_type.CUSTOM) for dev_type in dev_types: print(dev_type) devs = platform.get_devices(dev_type) if dev_type in (cl.device_type.DEFAULT, cl.device_type.ALL, getattr(cl.device_type, "CUSTOM", None)): continue for dev in devs: assert dev.type & dev_type == dev_type # }}} # {{{ test_user_event def test_user_event(ctx_factory): ctx = ctx_factory() if (ctx._get_cl_version() < (1, 1) and cl.get_cl_header_version() < (1, 1)): from pytest import skip skip("UserEvent is only available in OpenCL 1.1") # https://github.com/pocl/pocl/issues/201 _xfail_if_pocl(ctx.devices[0].platform, (0, 13), "PoCL's user events don't work right") status = {} def event_waiter1(e, key): e.wait() status[key] = True def event_waiter2(e, key): cl.wait_for_events([e]) status[key] = True from threading import Thread from time import sleep evt = cl.UserEvent(ctx) Thread(target=event_waiter1, args=(evt, 1)).start() sleep(.05) if status.get(1): raise RuntimeError("UserEvent triggered before set_status") evt.set_status(cl.command_execution_status.COMPLETE) sleep(.05) if not status.get(1): raise RuntimeError("UserEvent.wait timeout") assert evt.command_execution_status == cl.command_execution_status.COMPLETE evt = cl.UserEvent(ctx) Thread(target=event_waiter2, args=(evt, 2)).start() sleep(.05) if status.get(2): raise RuntimeError("UserEvent triggered before set_status") evt.set_status(cl.command_execution_status.COMPLETE) sleep(.05) if not status.get(2): raise RuntimeError("cl.wait_for_events timeout on UserEvent") assert evt.command_execution_status == cl.command_execution_status.COMPLETE # }}} # {{{ test_buffer_get_host_array def test_buffer_get_host_array(ctx_factory): if cl._PYPY: # FIXME pytest.xfail("Buffer.get_host_array not yet working on pypy") ctx = ctx_factory() mf = cl.mem_flags rng = np.random.default_rng(seed=42) host_buf = rng.random(25, dtype=np.float32) buf = cl.Buffer(ctx, mf.READ_WRITE | mf.USE_HOST_PTR, hostbuf=host_buf) host_buf2 = buf.get_host_array(25, np.float32) assert (host_buf == host_buf2).all() assert (host_buf.__array_interface__["data"][0] == host_buf.__array_interface__["data"][0]) assert host_buf2.base is buf buf = cl.Buffer(ctx, mf.READ_WRITE | mf.ALLOC_HOST_PTR, size=100) try: host_buf2 = buf.get_host_array(25, np.float32) raise AssertionError("MemoryObject.get_host_array should not accept buffer " "without USE_HOST_PTR") except cl.LogicError: pass host_buf = rng.random(25, dtype=np.float32) buf = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=host_buf) try: host_buf2 = buf.get_host_array(25, np.float32) raise AssertionError("MemoryObject.get_host_array should not accept buffer " "without USE_HOST_PTR") except cl.LogicError: pass # }}} # {{{ test_program_valued_get_info def test_program_valued_get_info(ctx_factory): ctx = ctx_factory() prg = cl.Program(ctx, """ __kernel void reverse(__global float *out) { out[get_global_id(0)] *= 2; } """).build() knl = prg.reverse assert knl.program == prg knl.program.binaries[0] # }}} # {{{ test_event_set_callback def test_event_set_callback(ctx_factory): import sys if sys.platform.startswith("win"): pytest.xfail("Event.set_callback not present on Windows") ctx = ctx_factory() queue = cl.CommandQueue(ctx) _xfail_if_pocl_gpu(queue.device, "event callbacks") if ctx._get_cl_version() < (1, 1): pytest.skip("OpenCL 1.1 or newer required for set_callback") rng = np.random.default_rng(seed=42) a_np = rng.random(50000, dtype=np.float32) b_np = rng.random(50000, dtype=np.float32) got_called = [] def cb(status): got_called.append(status) mf = cl.mem_flags a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np) b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np) prg = cl.Program(ctx, """ __kernel void sum(__global const float *a_g, __global const float *b_g, __global float *res_g) { int gid = get_global_id(0); res_g[gid] = a_g[gid] + b_g[gid]; } """).build() res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes) uevt = cl.UserEvent(ctx) evt = prg.sum(queue, a_np.shape, None, a_g, b_g, res_g, wait_for=[uevt]) evt.set_callback(cl.command_execution_status.COMPLETE, cb) uevt.set_status(cl.command_execution_status.COMPLETE) queue.finish() counter = 0 # yuck while not got_called: from time import sleep sleep(0.01) # wait up to five seconds (?!) counter += 1 if counter >= 500: break assert got_called # }}} # {{{ test_global_offset def test_global_offset(ctx_factory): context = ctx_factory() queue = cl.CommandQueue(context) _xfail_if_pocl_gpu(queue.device, "global offset") prg = cl.Program(context, """ __kernel void mult(__global float *a) { a[get_global_id(0)] *= 2; } """).build() n = 50 rng = np.random.default_rng(seed=42) a = rng.random(n, dtype=np.float32) queue = cl.CommandQueue(context) mf = cl.mem_flags a_buf = cl.Buffer(context, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a) step = 10 for ofs in range(0, n, step): prg.mult(queue, (step,), None, a_buf, global_offset=(ofs,)) a_2 = np.empty_like(a) cl.enqueue_copy(queue, a_2, a_buf) assert (a_2 == 2*a).all() # }}} # {{{ test_sub_buffers def test_sub_buffers(ctx_factory): ctx = ctx_factory() if (ctx._get_cl_version() < (1, 1) or cl.get_cl_header_version() < (1, 1)): from pytest import skip skip("sub-buffers are only available in OpenCL 1.1") alignment = ctx.devices[0].mem_base_addr_align queue = cl.CommandQueue(ctx) n = 30000 rng = np.random.default_rng(seed=42) a = (rng.random(n) * 100).astype(np.uint8) mf = cl.mem_flags a_buf = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a) start = (5000 // alignment) * alignment stop = start + 20 * alignment a_sub_ref = a[start:stop] a_sub = np.empty_like(a_sub_ref) cl.enqueue_copy(queue, a_sub, a_buf[start:stop]) assert np.array_equal(a_sub, a_sub_ref) # }}} # {{{ test_spirv def test_spirv(ctx_factory): ctx = ctx_factory() queue = cl.CommandQueue(ctx) if (ctx._get_cl_version() < (2, 1) or cl.get_cl_header_version() < (2, 1)): pytest.skip("SPIR-V program creation only available " "in OpenCL 2.1 and higher") if not queue.device.il_version: pytest.skip("SPIR-V program creation not supported by device") n = 50000 a_dev = cl.clrandom.rand(queue, n, np.float32) b_dev = cl.clrandom.rand(queue, n, np.float32) dest_dev = cl_array.empty_like(a_dev) from os.path import dirname, join spv_filename = join(dirname(__file__), "add-vectors-%d.spv" % queue.device.address_bits) with open(spv_filename, "rb") as spv_file: spv = spv_file.read() prg = cl.Program(ctx, spv).build() if (not prg.all_kernels() and queue.device.platform.name.startswith("AMD Accelerated")): pytest.skip("SPIR-V program creation on AMD did not result in any kernels") prg.sum(queue, a_dev.shape, None, a_dev.data, b_dev.data, dest_dev.data) assert la.norm((dest_dev - (a_dev+b_dev)).get()) < 1e-7 # }}} # {{{ test_coarse_grain_svm @pytest.mark.parametrize("use_opaque_style", [False, True]) def test_coarse_grain_svm(ctx_factory, use_opaque_style): import sys is_pypy = "__pypy__" in sys.builtin_module_names ctx = ctx_factory() queue = cl.CommandQueue(ctx) dev = ctx.devices[0] from pytest import skip from pyopencl.characterize import has_coarse_grain_buffer_svm if not has_coarse_grain_buffer_svm(queue.device): skip("device does not support coarse-grain SVM") if ("AMD" in dev.platform.name and dev.type & cl.device_type.CPU): pytest.xfail("AMD CPU doesn't do coarse-grain SVM") if ("AMD" in dev.platform.name and dev.type & cl.device_type.GPU): pytest.xfail("AMD GPU crashes on SVM unmap") if (dev.platform.vendor == "The pocl project" and dev.type & cl.device_type.GPU and "k40" in dev.name.lower()): pytest.xfail("Crashes on K40s via PoCL-CUDA") dtype = np.dtype(np.float32) n = 3000 if use_opaque_style: svm_ary = cl.SVMAllocation(ctx, n*dtype.itemsize, alignment=64, flags=cl.svm_mem_flags.READ_WRITE) else: svm_ary = cl.SVM(cl.csvm_empty(ctx, (n,), dtype, alignment=64)) if not is_pypy: # https://bitbucket.org/pypy/numpy/issues/52 assert isinstance(svm_ary.mem.base, cl.SVMAllocation) cl.enqueue_svm_memfill(queue, svm_ary, np.zeros((), dtype)) with svm_ary.map_rw(queue) as ary: if use_opaque_style: ary = ary.view(dtype) else: assert ary is svm_ary.mem assert ary.nbytes == n * dtype.itemsize ary.fill(17) orig_ary = ary.copy() prg = cl.Program(ctx, """ __kernel void twice(__global float *a_g) { a_g[get_global_id(0)] *= 2; } """).build() prg.twice(queue, (n,), None, svm_ary) if dev.platform.vendor == "The pocl project" \ and dev.type & cl.device_type.GPU: # clCreateBuffer from SVM doesn't work yet on GPU pocl prg.twice(queue, (n,), None, svm_ary) else: prg.twice(queue, (n,), None, svm_ary.as_buffer(ctx)) with svm_ary.map_ro(queue) as ary: if use_opaque_style: ary = ary.view(dtype) else: assert ary is svm_ary.mem assert np.array_equal(orig_ary*4, ary) new_ary = np.empty_like(orig_ary) new_ary.fill(-1) cl.enqueue_copy(queue, new_ary, svm_ary) assert np.array_equal(orig_ary*4, new_ary) # {{{ https://github.com/inducer/pyopencl/issues/372 buf_arr = cl.svm_empty(ctx, cl.svm_mem_flags.READ_ONLY, 10, np.int32) out_arr = cl.svm_empty(ctx, cl.svm_mem_flags.READ_WRITE, 10, np.int32) svm_buf_arr = cl.SVM(buf_arr) svm_out_arr = cl.SVM(out_arr) with svm_buf_arr.map_rw(queue) as ary: ary.fill(17) prg_ro = cl.Program(ctx, r""" __kernel void twice_ro(__global int *out_g, __global int *in_g) { out_g[get_global_id(0)] = 2*in_g[get_global_id(0)]; } """).build() prg_ro.twice_ro(queue, buf_arr.shape, None, svm_out_arr, svm_buf_arr) with svm_out_arr.map_ro(queue) as ary: print(ary) # }}} # }}} # {{{ test_fine_grain_svm def test_fine_grain_svm(ctx_factory): import sys is_pypy = "__pypy__" in sys.builtin_module_names ctx = ctx_factory() queue = cl.CommandQueue(ctx) _xfail_if_pocl_gpu(queue.device, "GPU SVM") from pytest import skip from pyopencl.characterize import has_fine_grain_buffer_svm if not has_fine_grain_buffer_svm(queue.device): skip("device does not support fine-grain SVM") n = 3000 ary = cl.fsvm_empty(ctx, n, np.float32, alignment=64) if not is_pypy: # https://bitbucket.org/pypy/numpy/issues/52 assert isinstance(ary.base, cl.SVMAllocation) ary.fill(17) orig_ary = ary.copy() prg = cl.Program(ctx, """ __kernel void twice(__global float *a_g) { a_g[get_global_id(0)] *= 2; } """).build() prg.twice(queue, ary.shape, None, cl.SVM(ary)) queue.finish() print(ary) assert np.array_equal(orig_ary*2, ary) # }}} # {{{ test_map_dtype @pytest.mark.parametrize("dtype", [ np.uint, cltypes.uint2, # type: ignore[attr-defined] ]) def test_map_dtype(ctx_factory, dtype): if cl._PYPY: # FIXME pytest.xfail("enqueue_map_buffer not yet working on pypy") ctx = ctx_factory() queue = cl.CommandQueue(ctx) dt = np.dtype(dtype) b = pyopencl.Buffer(ctx, pyopencl.mem_flags.READ_ONLY, dt.itemsize) array, _ev = pyopencl.enqueue_map_buffer(queue, b, pyopencl.map_flags.WRITE, 0, (1,), dt) with array.base: print(array.dtype) assert array.dtype == dt # }}} # {{{ test_compile_link def test_compile_link(ctx_factory): ctx = ctx_factory() if ctx._get_cl_version() < (1, 2) or cl.get_cl_header_version() < (1, 2): pytest.skip("Context and ICD loader must understand CL1.2 for compile/link") platform = ctx.devices[0].platform if platform.name == "Apple": pytest.skip("Apple doesn't like our compile/link test") # as of pocl 5.0 _xfail_if_pocl_gpu(ctx.devices[0], "compile/link") queue = cl.CommandQueue(ctx) vsink_prg = cl.Program(ctx, """//CL// void value_sink(float x) { } """).compile() pi_h__prg = cl.Program(ctx, """//CL// inline float get_pi() { return 3.1415f; } """).compile() main_prg = cl.Program(ctx, """//CL// #include "pi.h" void value_sink(float x); __kernel void experiment() { value_sink(get_pi() + get_global_id(0)); } """).compile(headers=[("pi.h", pi_h__prg)]) z = cl.link_program(ctx, [vsink_prg, main_prg], devices=ctx.devices) z.experiment(queue, (128**2,), (128,)) queue.finish() # }}} # {{{ test_copy_buffer_rect def test_copy_buffer_rect(ctx_factory): ctx = ctx_factory() queue = cl.CommandQueue(ctx) _xfail_if_pocl_gpu(queue.device, "rectangular copies") arr1 = cl_array.zeros(queue, (2, 3), "f") arr2 = cl_array.zeros(queue, (4, 5), "f") arr1.fill(1) cl.enqueue_copy( queue, arr2.data, arr1.data, src_origin=(0, 0), dst_origin=(1, 1), region=arr1.shape[::-1]) # }}} # {{{ test_threaded_nanny_events def test_threaded_nanny_events(ctx_factory): # https://github.com/inducer/pyopencl/issues/296 import gc import threading def create_arrays_thread(n1=10, n2=20): ctx = ctx_factory() queue = cl.CommandQueue(ctx) for _i1 in range(n2): for _i in range(n1): acl = cl.array.zeros(queue, 10, dtype=np.float32) acl.get() # Garbage collection triggers the error print("collected ", str(gc.collect())) print("stats ", gc.get_stats()) t1 = threading.Thread(target=create_arrays_thread) t2 = threading.Thread(target=create_arrays_thread) t1.start() t2.start() t1.join() t2.join() # }}} # {{{ test_empty_ndrange @pytest.mark.parametrize("empty_shape", [(0,), (3, 0, 2)]) def test_empty_ndrange(ctx_factory, empty_shape): ctx = ctx_factory() queue = cl.CommandQueue(ctx) if ctx._get_cl_version() < (1, 2) or cl.get_cl_header_version() < (1, 2): pytest.skip("OpenCL 1.2 required for empty NDRange support") a = cl_array.zeros(queue, empty_shape, dtype=np.float32) prg = cl.Program(ctx, """ __kernel void add_two(__global float *a_g) { a_g[get_global_id(0)] += 2; } """).build() prg.add_two(queue, a.shape, None, a.data, allow_empty_ndrange=True) # }}} # {{{ test_command_queue_context_manager def test_command_queue_context_manager(ctx_factory): ctx = ctx_factory() with cl.CommandQueue(ctx) as queue: q = queue with pytest.warns(cl.CommandQueueUsedAfterExit): q.flush() # }}} # {{{ test_capture_call def test_capture_call(ctx_factory): ctx = ctx_factory() queue = cl.CommandQueue(ctx) rng = np.random.default_rng() a_np = rng.random(500, dtype=np.float32) b_np = rng.random(500, dtype=np.float32) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) mf = cl.mem_flags a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np) b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np) prg = cl.Program(ctx, """ __kernel void sum( __global const float *a_g, __global const float *b_g, __global float *res_g) { int gid = get_global_id(0); res_g[gid] = a_g[gid] + b_g[gid]; } """).build() res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes) from io import StringIO sio = StringIO() prg.sum.capture_call(sio, queue, a_np.shape, None, a_g, b_g, res_g) compile_dict = {} exec(compile(sio.getvalue(), "captured.py", "exec"), compile_dict) compile_dict["main"]() # }}} # {{{ test_enqueue_copy_array def test_enqueue_copy_array(ctx_factory): # https://github.com/inducer/pyopencl/issues/618 ctx = ctx_factory() queue = cl.CommandQueue(ctx) if ctx._get_cl_version() < (1, 2) or cl.get_cl_header_version() < (1, 2): pytest.skip("requires CL 1.2") if not queue.device.image_support: pytest.skip("device has no image support") image_format = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.FLOAT) flags = cl.mem_flags.READ_ONLY image = np.ascontiguousarray(np.zeros((128, 128, 4), np.float32)) image_cl = cl.Image(ctx, flags, image_format, shape=(image.shape[1], image.shape[0], 1), is_array=True) cl.enqueue_copy(queue, dest=image, src=image_cl, origin=(0, 0, 0), region=(image.shape[1], image.shape[0], 1)) def test_enqueue_copy_array_2(ctx_factory): # https://github.com/inducer/pyopencl/issues/618 ctx = ctx_factory() queue = cl.CommandQueue(ctx) if ctx._get_cl_version() < (1, 2) or cl.get_cl_header_version() < (1, 2): pytest.skip("requires CL 1.2") if not queue.device.image_support: pytest.skip("device has no image support") image_format = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.FLOAT) image = np.ascontiguousarray(np.zeros((128, 128, 4), np.float32)) image_shape = (image.shape[1], image.shape[0]) array_shape = (*image_shape, 1) cl.Image(ctx, cl.mem_flags.READ_ONLY, image_format, shape=image_shape) image_array_cl = cl.Image(ctx, cl.mem_flags.READ_ONLY, image_format, shape=array_shape, is_array=True) image2_array_cl = cl.Image(ctx, cl.mem_flags.WRITE_ONLY, image_format, shape=array_shape, is_array=True) buffer_cl = cl.Buffer(ctx, cl.mem_flags.WRITE_ONLY, size=image.nbytes) cl._cl._enqueue_copy_image( queue, src=image_array_cl, dest=image2_array_cl, src_origin=(0, 0, 0), dest_origin=(0, 0, 0), region=array_shape) cl._cl._enqueue_copy_image_to_buffer( queue, src=image_array_cl, dest=buffer_cl, offset=0, origin=(0, 0, 0), region=array_shape) # }}} def test_zero_size_svm_allocations(ctx_factory): ctx = ctx_factory() from pytest import skip from pyopencl.characterize import has_coarse_grain_buffer_svm if not has_coarse_grain_buffer_svm(ctx.devices[0]): skip("device does not support coarse-grain SVM") # Go back to svm_empty once # https://github.com/numpy/numpy/issues/26366 is solved. # zero_sized_svm = cl.svm_empty(ctx, cl.svm_mem_flags.READ_WRITE, 0, np.float64) zero_sized_svm = cl.SVMAllocation(ctx, 0, 0, cl.svm_mem_flags.READ_WRITE) zero_sized_svm.release() from pyopencl.tools import SVMAllocator, SVMPool svm_alloc = SVMAllocator(ctx) zero_sized_svm = svm_alloc(0) zero_sized_svm.release() svm_pool = SVMPool(svm_alloc) zero_sized_svm = svm_pool(0) zero_sized_svm.release() def test_buffer_release(ctx_factory): ctx = ctx_factory() queue = cl.CommandQueue(ctx) mem_pool = cl.tools.MemoryPool(cl.tools.ImmediateAllocator(queue)) b = mem_pool.allocate(1000) print(type(b)) b.release() if __name__ == "__main__": import sys if len(sys.argv) > 1: exec(sys.argv[1]) else: from pytest import main main([__file__]) # vim: foldmethod=marker pyopencl-2025.1/PKG-INFO0000644000000000000000000001125714332717401011467 0ustar00Metadata-Version: 2.1 Name: pyopencl Version: 2025.1 Summary: Python wrapper for OpenCL Author-Email: Andreas Kloeckner Classifier: Development Status :: 5 - Production/Stable Classifier: Environment :: Console Classifier: Intended Audience :: Developers Classifier: Intended Audience :: Other Audience Classifier: Intended Audience :: Science/Research Classifier: License :: OSI Approved :: MIT License Classifier: Natural Language :: English Classifier: Programming Language :: C++ Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 3 :: Only Classifier: Topic :: Scientific/Engineering Classifier: Topic :: Scientific/Engineering :: Mathematics Classifier: Topic :: Scientific/Engineering :: Physics Project-URL: Documentation, https://documen.tician.de/pyopencl Project-URL: Homepage, https://mathema.tician.de/software/pyopencl Project-URL: Repository, https://github.com/inducer/pyopencl Requires-Python: ~=3.8 Requires-Dist: importlib-resources; python_version < "3.9" Requires-Dist: numpy Requires-Dist: platformdirs>=2.2 Requires-Dist: pytools>=2024.1.5 Requires-Dist: oclgrind-binary-distribution>=18.3; extra == "oclgrind" Requires-Dist: pocl-binary-distribution>=1.2; extra == "pocl" Requires-Dist: ruff; extra == "test" Requires-Dist: mako; extra == "test" Requires-Dist: mypy; extra == "test" Requires-Dist: pylint; extra == "test" Requires-Dist: pytest>=7; extra == "test" Provides-Extra: oclgrind Provides-Extra: pocl Provides-Extra: test Description-Content-Type: text/x-rst PyOpenCL: Pythonic Access to OpenCL, with Arrays and Algorithms =============================================================== .. |badge-gitlab-ci| image:: https://gitlab.tiker.net/inducer/pyopencl/badges/main/pipeline.svg :alt: Gitlab Build Status :target: https://gitlab.tiker.net/inducer/pyopencl/commits/main .. |badge-github-ci| image:: https://github.com/inducer/pyopencl/workflows/CI/badge.svg?branch=main&event=push :alt: Github Build Status :target: https://github.com/inducer/pyopencl/actions?query=branch%3Amain+workflow%3ACI+event%3Apush .. |badge-pypi| image:: https://badge.fury.io/py/pyopencl.svg :alt: Python Package Index Release Page :target: https://pypi.org/project/pyopencl/ .. |badge-zenodo| image:: https://zenodo.org/badge/1575307.svg :alt: Zenodo DOI for latest release :target: https://zenodo.org/badge/latestdoi/1575307 |badge-gitlab-ci| |badge-github-ci| |badge-pypi| |badge-zenodo| PyOpenCL lets you access GPUs and other massively parallel compute devices from Python. It tries to offer computing goodness in the spirit of its sister project `PyCUDA `__: * Object cleanup tied to lifetime of objects. This idiom, often called `RAII `__ in C++, makes it much easier to write correct, leak- and crash-free code. * Completeness. PyOpenCL puts the full power of OpenCL's API at your disposal, if you wish. Every obscure ``get_info()`` query and all CL calls are accessible. * Automatic Error Checking. All CL errors are automatically translated into Python exceptions. * Speed. PyOpenCL's base layer is written in C++, so all the niceties above are virtually free. * Helpful and complete `Documentation `__ as well as a `Wiki `__. * Liberal license. PyOpenCL is open-source under the `MIT license `__ and free for commercial, academic, and private use. * Broad support. PyOpenCL was tested and works with Apple's, AMD's, and Nvidia's CL implementations. Simple 4-step `install instructions `__ using Conda on Linux and macOS (that also install a working OpenCL implementation!) can be found in the `documentation `__. What you'll need if you do *not* want to use the convenient instructions above and instead build from source: * g++/clang new enough to be compatible with nanobind (specifically, full support of C++17 is needed) * `numpy `__, and * an OpenCL implementation. (See this `howto `__ for how to get one.) Links ----- * `Documentation `__ (read how things work) * `Python package index `__ (download releases, including binary wheels for Linux, macOS, Windows) * `Conda Forge `__ (download binary packages for Linux, macOS, Windows) * `Github `__ (get latest source code, file bugs)