/home/adrian

Benefiting from deliberately failing linkage

2023-12-26T00:00:01+02:00

Realizing that I have not written anything here for two years lets just start writing again¹: Compilation times for template-heavy C++ codebases such as the one at the center of my daily life can be a real pain. This mostly got worse since I started to really get my hands dirty in its depths during the extensive refactoring towards SIMD and GPU support². The current sad high point in compilation times was reached when compiling the first GPU-enabled simulation cases: More than 100 seconds for a single compile on my not too shabby system. This article will detail how I significantly reduced this on the build system level while gaining useful features.

λ ~/p/c/o/e/t/nozzle3d (openlb-env-cuda-env) • time make
make -C ../../.. core
make[1]: Entering directory '/home/common/projects/contrib/openlb-master'
make[1]: Nothing to be done for 'core'.
make[1]: Leaving directory '/home/common/projects/contrib/openlb-master'
nvcc -pthread --forward-unknown-to-host-compiler -x cu -O3 -std=c++17 --generate-
code=arch=compute_75,code=[compute_75,sm_75] --extended-lambda --expt-relaxed-
constexpr -rdc=true -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA  -
DDEFAULT_FLOATING_POINT_TYPE=float -I../../../src -c -o nozzle3d.o nozzle3d.cpp
nvcc nozzle3d.o -o nozzle3d -lolbcore -lpthread -lz -ltinyxml -L/run/opengl-
driver/lib -lcuda -lcudadevrt -lcudart -L../../../build/lib
________________________________________________________
Executed in  112.27 secs    fish           external
   usr time  109.46 secs  149.00 micros  109.46 secs
   sys time    2.42 secs   76.00 micros    2.42 secs

Even when considering that this compiles many dozens of individual CUDA kernels for multiple run-time selectable physical models and boundary conditions in addition to the simulation scaffold³ it still takes too long for comfortably iterating during development. Needless to say, things did not improve when I started working on heterogeneous execution and the single executable needed to also contain vectorized versions of all models for execution on CPUs in addition to MPI and OpenMP routines. Even worse, you really want to use Intel's C++ compilers when running CPU-based simulations on Intel-based clusters⁴ which plainly is not possible in such a homogeneous compiler setup where everyhing has to pass through nvcc.

λ ~/p/c/o/e/t/nozzle3d (openlb-env-gcc-env) • time make
g++ -O3 -Wall -march=native -mtune=native -std=c++17 -pthread -
DPLATFORM_CPU_SISD  -I../../../src -c -o nozzle3d.o nozzle3d.cpp
g++ nozzle3d.o -o nozzle3d -lolbcore -lpthread   -lz -ltinyxml     -L../../..
/build/lib
________________________________________________________
Executed in   31.77 secs    fish           external
   usr time   31.21 secs    0.00 micros   31.21 secs
   sys time    0.55 secs  693.00 micros    0.55 secs

Comparing the GPU build to the previous CPU-only compilation time of around 32 seconds – while nothing to write home about – it was still clear that time would be best spent on separating out the CUDA side of things, both to mitigate its performance impact and to enabled a mixed compiler environment.

Requirements

Firstly, any solution would need to exist within the existing plain Makefile based build system⁵ and should not complicate the existing build workflow for our users⁶. Secondly, it should allow for defining completely different compilers and configuration flags for the CPU- and the GPU-side of the application. The intial driving force of speeding up GPU-targeted compilation would then be satisfied as a side effect due to the ability of only recompiling the CPU-side of things as long as no new physical models are introduced. This restriction is useful in the present context as GPU kernels execute the computationally expensive part, i.e. the actual simulation, but generally do not change often during development of new simulation cases after the initial choice of physical model.

Approach

Following the requirements, a basic approach is to split the application into two compilation units: One containing only the CPU-implementation consisting of the high level algorithmic structure, pre- and post-processing, communication logic, CPU-targeted simulation kernels and calls to the GPU code. The other containing only the GPU code consisting of CUDA kernels and their immediate wrappers called from the CPU-side of things – i.e. only those parts that truly need to be compiled using NVIDIA's nvcc. Given two separated files cpustuff.cpp and gpustuff.cu it would be easy to compile them using separate configurations and then link them together into a single executable. The main implementation problem is how to generate two such separated compilation units that can be cleanly linked together, i.e. without duplicating symbols and similar hurdles.

Implementation

In days past the build system actually contained an option for such separated compilation: termed as the pre-compiled mode in OpenLB speak. This mode consisted of a somewhat rough and leaky separation between interface and implementation headers that was augmented by many hand-written C++ files containing explicit template instantiations of the aforementioned implementations for certain common arguments. These C++ files could then be compiled once into a shared library that was linked to the application unit compiled without access to the implementation headers. While this worked it was always a struggle to keep these files maintained. Additionally any benefit for the, at that time CPU-only, codebase was negligible and in the end not worth the effort any more causing it to be dropped somewhere on the road to release 1.4.

Nevertheless, the basic approach of compiling a shared libary of explicit template instantiations is sound if we can find a way to automatically generate the instantiations per-case instead of manually maintaining them. A starting point for this is to take a closer look at the linker errors produced when compiling a simulation case including only the interface headers for the GPU code. These errors contain partial signatures of all relevant methods from plain function calls

λ ~/p/c/o/e/l/cavity3dBenchmark (openlb-env-gcc-openmpi-cuda-env) • mpic++ 
cavity3d.o  -lpthread -lz -ltinyxml -L../../../build/lib -lolbcore
cavity3d.cpp:(...): undefined reference to `olb::gpu::cuda::device::synchronize()
'

to bulk and boundary collision operator constructions

cavity3d.cpp:(...): undefined reference to `olb::ConcreteBlockCollisionO
olb::descriptors::D3Q19<>, (olb::Platform)2, olb::dynamics::Tuple
descriptors::D3Q19<>, olb::momenta::Tuple
momenta::BulkMomentum, olb::momenta::BulkStress, olb::momenta::DefineToNEq>, olb:
:equilibria::SecondOrder, olb::collision::BGK, olb::dynamics::DefaultCombination>
>::ConcreteBlockCollisionO()'

as well as core data structure accessors:

cavity3d.cpp:(.text.
_ZN3olb20ConcreteBlockLatticeIfNS_11descriptors5D3Q19IJEEELNS_8PlatformE2EE21getP
opulationPointersEj[
_ZN3olb20ConcreteBlockLatticeIfNS_11descriptors5D3Q19IJEEELNS_8PlatformE2EE21getP
opulationPointersEj]+0x37): undefined reference to `olb::gpu::cuda::CyclicColumn<
float>::operator[](unsigned long)'

These errors are easily turned into a sorted list of unique missing symbols using basic piping

build/missing.txt: $(OBJ_FILES)
    $(CXX) $^ $(LDFLAGS) -lolbcore 2>&1 \
  | grep -oP ".*undefined reference to \`\K[^']+\)" \
  | sort \
  | uniq > $@

which only assumes that the locale is set to english and – surprisingly – works consistently accross any relevant C++ compilers⁷, likely due to all of them using either the GNU Linker or a drop-in compatible alternative thereto. The resulting plain list of C++ method signatures hints at the reasonably structured and consistent template language employed by OpenLB:

olb::ConcreteBlockCollisionO<float, olb::descriptors::D3Q19<>, (olb::Platform)2, 
olb::CombinedRLBdynamics<float, olb::descriptors::D3Q19<>, olb::dynamics::Tuple<
float, olb::descriptors::D3Q19<>, olb::momenta::Tuple<olb::momenta::BulkDensity, 
olb::momenta::BulkMomentum, olb::momenta::BulkStress, olb::momenta::DefineToNEq>,
olb::equilibria::SecondOrder, olb::collision::BGK, olb::dynamics::
DefaultCombination>, olb::momenta::Tuple<olb::momenta::VelocityBoundaryDensity<0,
-1>, olb::momenta::FixedVelocityMomentumGeneric, olb::momenta::
RegularizedBoundaryStress<0, -1>, olb::momenta::DefineSeparately> > >::
ConcreteBlockCollisionO()
olb::gpu::cuda::CyclicColumn<float>::operator[](unsigned long)
olb::gpu::cuda::device::synchronize()
// [...]

For example, local cell models – Dynamics in OpenLB speak – are mostly implemented as tuples of momenta, equilibrium functions and collision operators⁸. All such relevant classes tend to follow a consistent structure in what methods with which arguments and return types they implement. We can use this domain knowledge of our codebase to transform the incomplete signatures in our new missing.txt into a full list of explicit template instantiations written in valid C++.

build/olbcuda.cu: build/missing.txt
# Generate includes of the case source
# (replaceable by '#include ' if no custom operators are implemented in 
the application)
    echo -e '$(CPP_FILES:%=\n#include "../%")' > $@
# Transform missing symbols into explicit template instantiations by:
# - filtering for a set of known and automatically instantiable methods
# - excluding destructors
# - dropping resulting empty lines
# - adding the explicit instantiation prefix (all supported methods are void, 
luckily)
    cat build/missing.txt \
    | grep '$(subst $() $(),\|,$(EXPLICIT_METHOD_INSTANTIATION))' \
    | grep -wv '.*\~.*\|FieldTypeRegistry()' \
    | xargs -0 -n1 | grep . \
    | sed -e 's/.*/template void &;/' -e 's/void void/void/' >> $@
# - filtering for a set of known and automatically instantiable classes
# - dropping method cruft and wrapping into explicit class instantiation
# - removing duplicates
    cat build/missing.txt \
    | grep '.*\($(subst $() $(),\|,$(EXPLICIT_CLASS_INSTANTIATION))\)<' \
    | sed -e 's/\.*>::.*/>/' -e 's/.*/template class &;/' -e 's/class 
    void/class/' \
    | sort | uniq >> $@

Note that this is only possible due to full knowledge of and control over the target codebase. In case this is not clear already: In no way do I recommend that this approach be followed in a more general context⁹. It was only the quickest and most maintainable approach to achieving the stated requirements given the particulars of OpenLB.

As soon as the build system dumped the first olbcuda.cu file into the build directory I thought that all that remained was to compile this into a shared library and link it all together. However, the resulting shared library contained not only the explicitly instantiated symbols but also additional stuff that they required. This caused quite a few duplicate symbol errors when I tried to link the library and the main executable. While linking could still be forced by ignoring these errors, the resulting executable was not running properly. This is where I encountered something unfamiliar to me: linker version scripts.

The same as for basically every question one encounters in the context of such fundamental software as GNU ld, first released alongside the other GNU Binutils in the 80s, a solution has long since been developed. For our particular problem the solution are linker version scripts.

LIBOLBCUDA { global: 
/* list of mangeled symbols to globally expose [...] */
_ZGVZN3olb9utilities14TypeIndexedMapIPNS_12AnyFieldTypeIfNS_11descriptors5D3Q19IJ
EEELNS_8PlatformE0EEENS_17FieldTypeRegistryIfS5_LS6_0EEEE9get_indexINS_18Operator
ParametersINS_19CombinedRLBdynamicsIfS5_NS_8dynamics5TupleIfS5_NS_7momenta5TupleI
NSH_11BulkDensityENSH_12BulkMomentumENSH_10BulkStressENSH_11DefineToNEqEEENS_10eq
uilibria11SecondOrderENS_9collision3BGKENSF_18DefaultCombinationEEENSI_INSH_18Inn
erEdgeDensity3DILi0ELi1ELi1EEENSH_28FixedVelocityMomentumGenericENSH_17InnerEdgeS
tress3DILi0ELi1ELi1EEENSH_16DefineSeparatelyEEEEEEEEEmvE5index;
local: *;
};

Such a file can be passed to the linker via the --version-script argument and can be used to control which symbols the shared library should expose. For our mixed build mode the generation of this script is realized as an additional Makefile target:

build/olbcuda.version: $(CUDA_OBJ_FILES)
    echo 'LIBOLBCUDA { global: ' > $@
# Declare exposed explicitly instantiated symbols to prevent duplicate 
definitions by:
# - filtering for the set of automatically instantiated classes
# - excluding CPU_SISD symbols (we only instantiate GPU_CUDA-related symbols)
# - dropping the shared library location information
# - postfixing by semicolons
    nm $(CUDA_OBJ_FILES) \
    | grep '$(subst $() $(),\|,$(EXPLICIT_CLASS_INSTANTIATION))\|cuda.
    *device\|checkPlatform' \
    | grep -wv '.*sisd.*' \
    | cut -c 20- \
    | sed 's/$$/;/' >> $@
    echo 'local: *; };' >> $@

Note that we do not need to manually mangle the symbols in our olbcuda.cu but can simply read them from the library's object file using the nm utility. The two instances of grep are again the point where knowledge of the code base is inserted¹⁰.

At this point all that is left is to link it all together using some final build targets:

libolbcuda.so: $(CUDA_OBJ_FILES) build/olbcuda.version
    $(CUDA_CXX) $(CUDA_CXXFLAGS) -Xlinker --version-script=build/olbcuda.version 
    -shared $(CUDA_OBJ_FILES) -o $@

$(EXAMPLE): $(OBJ_FILES) libolbcuda.so
    $(CXX) $(OBJ_FILES) -o $@ $(LDFLAGS) -L . -lolbcuda -lolbcore $(CUDA_LDFLAGS)

Here the shared library is compiled using the separately defined CUDA_CXX compiler and associated flags while the example case is compiled using CXX, realizing the required mixed compiler setup. For the final target we can now define a mode that only recompiles the main application while reusing the shared library:

$(EXAMPLE)-no-cuda-recompile: $(OBJ_FILES)
    $(CXX) $^ -o $(EXAMPLE) $(LDFLAGS) -L . -lolbcuda -lolbcore $(CUDA_LDFLAGS)

.PHONY: no-cuda-recompile
no-cuda-recompile: $(EXAMPLE)-no-cuda-recompile

While the initial compile of both the main CPU application and the GPU shared library any additional recompile using make no-cuda-recompile is sped up significantly. For example the following full compilation of a heterogeneous application with MPI, OpenMP, AVX-512 Vectorization on CPU and CUDA on GPU takes around 115 seconds:

λ ~/p/c/o/e/t/nozzle3d (openlb-env-gcc-openmpi-cuda-env) • time make
mpic++ -O3 -Wall -march=native -mtune=native -std=c++17 -pthread -
DPARALLEL_MODE_MPI  -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA  -
DDEFAULT_FLOATING_POINT_TYPE=float -I../../../src -c -o nozzle3d.o nozzle3d.cpp
mpic++ nozzle3d.o  -lpthread -lz -ltinyxml -L../../../build/lib -lolbcore 2>&1 | 
grep -oP ".*undefined reference to \`\K[^']+\)" | sort | uniq > build/missing.
txt
nvcc -O3 -std=c++17 --generate-code=arch=compute_75,code=[compute_75,sm_75] --
extended-lambda --expt-relaxed-constexpr -rdc=true -I../../../src -
DPARALLEL_MODE_MPI -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -Xcompiler -fPIC -c -
o build/olbcuda.o build/olbcuda.cu
nvcc -O3 -std=c++17 --generate-code=arch=compute_75,code=[compute_75,sm_75] --
extended-lambda --expt-relaxed-constexpr -rdc=true -Xlinker --version-
script=build/olbcuda.version -shared build/olbcuda.o -o libolbcuda.so
mpic++ nozzle3d.o -o nozzle3d  -lpthread -lz -ltinyxml -L../../../build/lib -L . 
-lolbcuda -lolbcore -L/run/opengl-driver/lib -lcuda -lcudadevrt -lcudart
________________________________________________________
Executed in  115.34 secs    fish           external
   usr time  112.68 secs  370.00 micros  112.68 secs
   sys time    2.68 secs  120.00 micros    2.68 secs

Meanwhile any additional compilation without introduction of new physical models (leading to the instantiation of additional GPU kernels) using make no-cuda-recompile takes just 37 seconds:

λ ~/p/c/o/e/t/nozzle3d (openlb-env-gcc-openmpi-cuda-env) • time make no-cuda-
recompile
mpic++ -O3 -Wall -march=native -mtune=native -std=c++17 -pthread -
DPARALLEL_MODE_MPI  -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA  -
DDEFAULT_FLOATING_POINT_TYPE=float -I../../../src -c -o nozzle3d.o nozzle3d.cpp
mpic++ nozzle3d.o -o nozzle3d  -lpthread -lz -ltinyxml -L../../../build/lib -L . 
-lolbcuda -lolbcore -L/run/opengl-driver/lib -lcuda -lcudadevrt -lcudart
________________________________________________________
Executed in   36.47 secs    fish           external
   usr time   35.71 secs    0.00 micros   35.71 secs
   sys time    0.75 secs  564.00 micros    0.75 secs

This speedup of ~3 for most compiles during iterative development alone is worth the effort of introducing this new mode. Additionally, the logs also already showcase mixed compilation as the CPU side of things is compiled using mpic++ resp. GNU C++ while the shared libary is compiled using nvcc. This extends seamlessly to more complex setups combining MPI, OpenMP, AVX-512 vectorization on CPU and CUDA on GPU in a single application.

Conclusion

All in all this approach turned out to be unexpectedly stable and portable accross systems and compilers from laptops to supercomputers. While it certainly is not the most beautiful thing I ever implemented, to say the least, it is very workable in practice and noticeably eases day to day development. In any case, the mixed compilation mode was included in OpenLB release 1.6 and has worked without a hitch since then. The mixed compilation mode is also isolated to just a few optional Makefile targets and did not require any changes to the actual codebase – meaning that it can just quietly be dropped should a better solution for the requirements come along.

For the potentially empty set of people that have read this far, are interested in CFD simulations using LBM and did not run screaming from the rather pragmatic build solution presented here: If you want to spend a week learning about LBM theory and OpenLB practice from invited lecturers at the top of the field as well as my colleagues and me, our upcoming Spring School may be of interest. Having taken place for quite a few years now at diverse locations such as Berlin, Tunisia, Krakow and Greenwich the 2024 rendition will take place at the historical Heidelberger Akademie der Wissenschaften in March. I'd be happy to meet you there!

…and do my part in feeding the LLM training machine :-)↩︎
Definitely a double edged sword: On the one side it enables concise DSL-like compositions of physical models while supporting automatic code optimization and efficient execution accross heterogeneous hardware. On the other side my much younger, Pascal-fluent, self would not be happy with how cryptic and unmaintainable many of my listings can look to the outsider. In any case, OpenLB as a heavily templatized and meta-programmed C++ software library is a foundational design decision.↩︎
Data structures, pre- and post-processing logic, IO routines, …↩︎
Commonly improving performance by quite a few percent↩︎
Which was a deliberate design decision in order to minimize dependencies considering the minimal build complexity required by OpenLB as a plain CPU-only MPI code. While this could of course be reconsidered in the face of increased target complexity it was not the time to open that bottle.↩︎
Mostly domain experts from process engineering, physics or mathematics without much experience in software engineering.↩︎
Which spans various versions of GCC, Clang, Intel C++ and NVIDIA nvcc↩︎
Momenta representing how to compute macroscopic quantities such as density and velocity, equilibrium representing the undistrubed representation of said quantities in terms of population values and the collision operator representing the specific function used to relax the current population towards this equilibrium. For more details on LBM see e.g. my articles on Fun with Compute Shaders and Fluid Dynamics, a Year of LBM or even my just-in-time visualized literate implementation.↩︎
However, implementing such a explicit instantiation generator that works for any C++ project could be an interesting project for… somebody.↩︎
Now that I write about it this could probably be modified to automatically and eliminate conflicts by only exposing the symbols that are missing from the main application↩︎

Reproducible development environment for Teensy

2021-10-11T00:00:01+02:00

So for a change of scenery I recently started to mess around with microcontrollers again. Since the last time that I had any real contact with this area was probably around a decade ago — programming an ASURO robot — I started basically from scratch. Driven by the goal of building and programming a fancy mechanical keyboard (as it seems to be the trendy thing to do) I chose the Arduino-compatible Teensy 4.0 board. While I appreciate the rich and accessible software ecosystem for this platform, I don't really want to use some special IDE, applying amongst other things¹ weird non-standard preprocessing to my code. In this vein it would also be nice to use my accustomed Nix-based toolchain which leads me to this article.

Roughly following what others did for Teensy 3.1 while adapting it to Teensy 4.0 and Nix flakes it is simple to build and flash some basic C++ programs onto a USB-attached board. The adapted version of the Arduino library is available on Github and can be compiled into a shared library using flags

MCU     = IMXRT1062
MCU_DEF = ARDUINO_TEENSY40

OPTIONS  = -DF_CPU=600000000 -DUSB_SERIAL -DLAYOUT_US_ENGLISH
OPTIONS += -D__$(MCU)__ -DARDUINO=10813 -DTEENSYDUINO=154 -D$(MCU_DEF)

CPU_OPTIONS = -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -mthumb

CPPFLAGS = -Wall -g -O2 $(CPU_OPTIONS) -MMD $(OPTIONS) -ffunction-sections -
fdata-sections
CXXFLAGS = -felide-constructors -fno-exceptions -fpermissive -fno-rtti -Wno-
error=narrowing -I@TEENSY_INCLUDE@

included into a run-of-the-mill Makefile and relying on the arm-none-eabi-gcc compiler. Correspondingly, the derivation for the core library core.nix is straight forward. It clones a given version of the library repository, jumps to the teensy4 directory, deletes the example main.cpp file to exclude it from the library and applies a Makefile adapted from the default one. For the result only headers, common flags and the linker script IMXRT1062.ld are exported.

As existing Arduino sketches commonly consist of a single C++ file (ignoring some non-standard stuff for later) most builds can be handled generically by a mapping of *.cpp files into flashable *.hex files. This is realized by the following function based on the teensy-core derivation and a default makefile:

build = name: source: pkgs.stdenv.mkDerivation rec {
  inherit name;

  src = source;

  buildInputs = with pkgs; [
    gcc-arm-embedded
    teensy-core
  ];

  buildPhase = ''
    export CC=arm-none-eabi-gcc
    export CXX=arm-none-eabi-g++
    export OBJCOPY=arm-none-eabi-objcopy
    export SIZE=arm-none-eabi-size

    cp ${./Makefile.default} Makefile
    export TEENSY_PATH=${teensy-core}
    make
  '';

  installPhase = ''
    mkdir $out
    cp *.hex $out/
  '';
};

The derivation yielded by build "test" ./test results in a result directory containing a *.hex file for each C++ file contained in the test directory. Adding a loader function to be used in convenient nix flake run commands

loader = name: path: pkgs.writeScript name ''
  #!/bin/sh
  ${pkgs.teensy-loader-cli}/bin/teensy-loader-cli --mcu=TEENSY40 -w ${path}
'';

a reproducible build of the canonical blink example² is realized using:

nix flake clone git+https://code.kummerlaender.eu/teensy-env --dest .
nix run .#flash-blink

Expanding on this, the teensy-env flake also provides convenient image(With) functions for building programs that depend on additional Arduino libraries such as for controlling servos. E.g. the build of a program test.cpp placed in a src folder

#include 
#include 

extern "C" int main(void) {
  Servo servo;
  // Servo connected to PWM-capable pin 1
  servo.attach(1);
  while (true) {
    // Match potentiometer connected to analog pin 7
    servo.write(map(analogRead(7), 0, 1023, 0, 180));
    delay(20);
  }
}

is fully described by the flake:

{
  description = "Servo Test";

  inputs = {
    teensy-env.url = git+https://code.kummerlaender.eu/teensy-env;
  };

  outputs = { self, teensy-env }: let
    image = teensy-env.custom.imageWith
      (with teensy-env.custom.teensy-extras; [ servo ]);

  in {
    defaultPackage.x86_64-linux = image.build "servotest" ./src;
  };
}

At first I expected the build of uLisp ³ to proceed equally smoothly as this implementation of Lisp for microcontrollers is provided as a single ulisp-arm.ino file. However, the *.ino extension is not just for show here as beyond even the replacement of main by loop and setup — which would be easy to fix — it relies on further non-standard preprocessing offered by the Arduino toolchain. I quickly aborted my efforts towards patching in e.g. the forward-declarations which are automagically added during the build (is it really such a hurdle to at least declare stuff before referring to it… oh well) and instead followed a less pure approach using arduino-cli to access the actual Arduino preprocessor.

arduino-cli core install arduino:samd
arduino-cli compile --fqbn arduino:samd:arduino_zero_native --preprocess ulisp-
arm.ino > ulisp-arm.cpp

The problematic line w.r.t. to reproducible builds in Nix is the installation of the arduino:samd toolchain which requires network access and wants to install stuff to home. Pulling in arbitrary stuff over the network is of course not something one wants to do in an isolated and hopefully reproducible build environment which is why this kind of stuff is heavily restricted in common Nix derivations. Luckily it is possible to misuse (?) a fixed-output derivation to describe the preprocessing of ulisp-arm.ino into a standard C++ ulisp-arm.cpp compilable using the GCC toolchain.

The relevant file ulisp.nix pulls in the uLisp source from Github and calls arduino-cli to install its toolchain to a temporary home folder followed by preprocessing the source into the derivation's output. The relevant lines for turning this into a fixed-output derivation are

outputHashMode = "flat";
outputHashAlgo = "sha256";
outputHash = "mutVLBFSpTXgUzu594zZ3akR/Z7e9n5SytU6WoQ6rKA=";

to declare the hash of the resulting file. After this point building and flashing uLisp using the teensy-env flake works the same as for any C++ program. The two additional SPI and Wire library dependencies are added easily using imageWith:

teensy-ulisp = let
  ulisp-source = import ./ulisp.nix { inherit pkgs; };
  ulisp-deps   = with teensy-extras; [ spi wire ];
in (imageWith ulisp-deps).build
  "teensy-ulisp"
  (pkgs.linkFarmFromDrvs "ulisp" [ ulisp-source ]);

So we are now able to build and flash uLisp onto a conveniently attached Teensy 4.0 board using only:

nix flake clone git+https://code.kummerlaender.eu/teensy-env --dest .
nix run .#flash-ulisp

Connecting finally via serial terminal screen /dev/ttyACM0 9600 we end up in a LISP environment where we can play around with the microcontroller at our leisure without reflashing.

59999> (* 21 2)
42

59999> (defun blink (&optional x)
         (pinmode 13 t)
         (digitalwrite 13 x)
         (delay 1000)
         (blink (not x)))

59966> (blink)

As always, the code of everything discussed here is available via Git on code.kummerlaender.eu. While I only focused on Teensy 4.0 it should be easy to adapt to other versions by changing the compiler flags using PaulStoffregen/cores as a reference.

e.g. forcing me to patch my XMonad config to even get a usable UI…↩︎
Simply flashing the on-board LED periodically↩︎
Interactive development using a Lisp REPL on a microcontroller, how much more can you really ask for?↩︎

Noise and Ray Marching

2021-09-26T00:00:01+02:00

LiterateLB's volumetric visualization functionality relies on a simple ray marching implementation to sample both the 3D textures produced by the simulation side of things and the signed distance functions that describe the obstacle geometry. While this produces surprisingly nice looking results in many cases, some artifacts of the visualization algorithm are visible depending on the viewport and sample values. Extending the ray marching code to utilize a noise function is one possibility of mitigating such issues that I want to explore in this article.

While my original foray into just in time visualization of Lattice Boltzmann based simulations was only an aftertought to playing around with SymPy based code generation approaches I have since put some work into a more fully fledged code. The resulting LiterateLB code combines symbolic generation of optimized CUDA kernels and functionality for just in time fluid flow visualization into a single literate document.

For all fortunate users of the Nix package manager, tangling and building this from the Org document is as easy as executing the following commands on a CUDA-enabled NixOS host.

git clone https://code.kummerlaender.eu/LiterateLB
nix-build
./result/bin/nozzle

Image Synthesis

The basic ingredient for producing volumetric images from CFD simulation data is to compute some scalar field of samples $s : \mathbb{R}^3 \to \mathbb{R}_0^+$ . Each sample $s(x)$ can be assigned a color $c(x)$ by some convenient color palette mapping scalar values to a tuple of red, green and blue components.

The task of producing an image then consists to sampling the color field along a ray assigned to a pixel by e.g. a simple pinhole camera projection. For this purpose a simple discrete approximation of the volume rendering equation with constant step size $\Delta x \in \mathbb{R}^+$ already produces suprisingly good pictures. Specifically

$C(r) = \sum_{i=0}^N c(i \Delta x) \mu (i \Delta x) \prod_{j=0}^{i-1} \left(1 - \mu(j\Delta x)\right)$

is the color along ray

r

of length

N\Delta x

with local absorption values

\mu(x)

. This local absorption value may be chosen seperately of the sampling function adding an additional tweaking point.

The basic approach may also be extended arbitrarily, e.g. it is only the inclusion of a couple of phase functions away from being able recover the color produced by light travelling through the participating media that is our atmosphere.

The Problem

There are many different possibilities for the choice of sampling function $s(x)$ given the results of a fluid flow simulation. E.g. velocity and curl norms, the scalar product of ray direction and shear layer normals or vortex identifiers such as the Q criterion

$Q = \|\Omega\|^2 - \|S\|^2 > 0 \text{ commonly thresholded to recover isosurfaces}$

that contrasts the local vorticity and strain rate norms. The strain rate tensor

S

is easily recovered from the non-equilibrium populations

f^\text{neq}

of the simulation lattice — and is in fact already used for the turbulence model. Similarly, the vorticity

\Omega = \nabla \times u

can be computed from the velocity field using a finite difference stencil.

The problem w.r.t. rendering when thresholding sampling values to highlight structures in the flow becomes apparent in the following picture:

Q Criterion

Curl Norm

While the exact same volume discretization was used for both visualizations, the slices are much less apparent for the curl norm samples due to the more gradual changes. In general the issue is most prominent for scalar fields with large gradients (specifically the sudden jumps that occur when restricting sampling to certain value ranges as is the case for the Q criterion).

Colors of Noise

The reason for these artifacts is primarily choice of start offsets w.r.t. the traversed volume in addition the the step width. While this tends to become less noticable when decreasing said steps, this is not desirable from a performance perspective.

What I settled on for LiterateLB's renderer are view-aligned slicing and random jittering to remove most visible artifacts. The choice of randomness for jittering the ray origin is critical here as plain random numbers tend to produce a distracting static-like pattern. A common choice in practice is to use so called blue noise instead. While both kinds of noise eliminate most slicing artifacts, the remaining patterns tend to be less noticeable for blue noise. Noise is called blue if it contains only higher frequency components which makes it harder for the pattern recognizer that we call brain to find patterns where there should be none.

The void-and-cluster algorithm ¹ provides a straight forward method for pre-computing tileable blue noise textures that can be reused during the actual visualization. Tileability is a desirable property for this as we otherwise would either need a noise texture large enough to cover the entire image or instead observe jumps at the boundary between the tiled texture.

The first ingredient for void-and-cluster is a filteredPattern function that applies a plain Gaussian filter with given $\sigma$ to a cyclic 2d array. Using cyclic wrapping during the application of this filter is what renders the generated texture tileable.

def filteredPattern(pattern, sigma):
    return gaussian_filter(pattern.astype(float), sigma=sigma, mode='wrap', 
    truncate=np.max(pattern.shape))

This function will be used to compute the locations of the largest void and tightest cluster in a binary pattern (i.e. a 2D array of 0s and 1s). In this context a void describes an area with only zeros and a cluster describes an area with only ones.

def largestVoidIndex(pattern, sigma):
    return np.argmin(masked_array(filteredPattern(pattern, sigma), mask=pattern))

These two functions work by considering the given binary pattern as a float array that is blurred by the Gaussian filter. The blurred pattern gives an implicit ordering of the voidness of each pixel, the minimum of which we can determine by a simple search. It is important to exclude the initial binary pattern here as void-and-cluster depends on finding the largest areas where no pixel is set.

def tightestClusterIndex(pattern, sigma):
    return np.argmax(masked_array(filteredPattern(pattern, sigma), mask=np.
    logical_not(pattern)))

Computing the tightest cluster works in the same way with the exception of searching the largest array element and masking by the inverted pattern.

def initialPattern(shape, n_start, sigma):
    initial_pattern = np.zeros(shape, dtype=np.bool)
    initial_pattern.flat[0:n_start] = True
    initial_pattern.flat = np.random.permutation(initial_pattern.flat)
    cluster_idx, void_idx = -2, -1
    while cluster_idx != void_idx:
        cluster_idx = tightestClusterIndex(initial_pattern, sigma)
        initial_pattern.flat[cluster_idx] = False
        void_idx = largestVoidIndex(initial_pattern, sigma)
        initial_pattern.flat[void_idx] = True
    return initial_pattern

For the initial binary pattern we set n_start random locations to one and then repeatedly break up the largest void by setting its center to one. This is also done for the tightest cluster by setting its center to zero. We do this until the locations of the tightest cluster and largest void overlap.

def blueNoise(shape, sigma):

The actual algorithm utilizes these three helper functions in four steps:

Initial pattern generation

n = np.prod(shape)
n_start = int(n / 10)

initial_pattern = initialPattern(shape, n_start, sigma)
noise = np.zeros(shape)

Eliminiation of n_start tightest clusters

pattern = np.copy(initial_pattern)
for rank in range(n_start,-1,-1):
    cluster_idx = tightestClusterIndex(pattern, sigma)
    pattern.flat[cluster_idx] = False
    noise.flat[cluster_idx] = rank

Elimination of n/2-n_start largest voids

pattern = np.copy(initial_pattern)
for rank in range(n_start,int((n+1)/2)):
    void_idx = largestVoidIndex(pattern, sigma)
    pattern.flat[void_idx] = True
    noise.flat[void_idx] = rank

Elimination of n-n/2 tightest clusters of the inverted pattern

for rank in range(int((n+1)/2),n):
    cluster_idx = tightestClusterIndex(np.logical_not(pattern), sigma)
    pattern.flat[cluster_idx] = True
    noise.flat[cluster_idx] = rank

For each elimination the current rank is stored in the noise texture producing a 2D arrangement of the integers from 0 to n. As the last step the array is divided by n-1 to yield a grayscale texture with values in $[0,1]$ .

return noise / (n-1)

In order to check whether this actually generated blue noise, we can take a look at the Fourier transformation for an exemplary $100 \times 100$ texture:

Blue noise texture

Fourier transformation

One can see qualitatively that higher frequency components are significantly more prominent than lower ones. Contrasting this to white noise generated using uniformly distributed random numbers, no preference for any range of frequencies can be observed:

White noise texture

Fourier transformation

Comparison

Contasting the original Q criterion visualization with one produced using blue noise jittering followed by a soft blurring shader, we can see that the slicing artifacts largely vanish. While the jittering is still visible to closer inspection, the result is significantly more pleasing to the eye and arguably more faithful to the underlying scalar field.

Simple ray marching

Ray marching with blue noise jittering

While white noise also obcures the slices, its lower frequency components produce more obvious static in the resulting image compared to blue noise. As both kinds of noise are precomputed we can freely choose the kind of noise that will produce the best results for our sampling data.

Blue noise

White noise

In practice where the noise is applied just-in-time during the visualization of a CFD simulation, all remaining artifacts tend to become invisible. This can be seen in the following video of the Q criterion evaluated for a simulated nozzle flow in LiterateLB:

Ulichney, R. Void-and-cluster method for dither array generation. In Electronic Imaging (1993). DOI: 10.1117/12.152707.↩︎

Working with tuples using swallowing and generic lambdas

2020-05-26T00:00:01+02:00

Suppose you have some kind of list of types. Such a list can by itself be used to perform any compile time computation one might come up with. So let us suppose that you additionally want to construct a tuple from something that is based on this list. i.e. you want to connect the compile time only type list to a run time object. In such a case you might run into new question such as: How do I call constructors for each of my tuple values? How do I offer access to the tuple values using only the type as a reference? How do I call a function for each value in the tuple while preserving the connection to the compile time list? If such questions are of interest to you, this article might possibly also be.

While the standard’s tuple template is part of the C++ subset I use in basically all of my developments¹ I recently had to revisit some of these questions while reworking OpenLB’s core data structure using its meta descriptor concept. The starting point for this was a class template called FieldArrayD to store an array of instances of a single field in a SIMD vectorization friendly structure of arrays layout. As a LBM lattice in practice stores not just one such field type but multiple of them (all declared in the central descriptor structure) I then wanted a MultiFieldArrayD class template that does just that. i.e. a simple wrapper that accepts a list of fields as a variadic template parameter pack and instantiates a FieldArrayD for each of them. A sensible place for storing these instances is of course our trusty std::tuple:

/// SoA storage for instances of a single FIELD
template<typename T, typename DESCRIPTOR, typename FIELD>
struct FieldArrayD : public ColumnVector<T,DESCRIPTOR::template size<FIELD>()> {
  FieldArrayD(std::size_t count):
    ColumnVector<T,DESCRIPTOR::template size<FIELD>()>(count) { }
/* [...] */
};

template<typename T, typename DESCRIPTOR, typename... FIELDS>
class MultiFieldArrayD {
private:
  std::tuple<FieldArrayD<T,DESCRIPTOR,FIELDS>...> _data;
/* [...] */

A constructor for such a MultiFieldArrayD class should now pass the same count of elements to each element constructor of the _data tuple. This is more difficult than simply forwarding an individual value to each element which could be done using a common perfect forwarding pattern. But after some playing around I came up with a constructor

MultiFieldArrayD(std::size_t count):
  _count(count),
  // Trickery to construct each member of _data with `count`.
  // Uses the comma operator in conjunction with type dropping.
  _data((utilities::meta::void_t<FIELDS>(), count)...) { }
{ }

that does what I want in much more compact fashion that I expected at the beginning. Lets unwrap this: utilities::meta::void_t is a place holder implementation of C++17’s std::void_t that I use until we upgrade our C++14 code base² to something more recent. In this case this somewhat aids the exposition as we can easily take a look at its definition:

template <typename...>
using void_t = void;

If we consider this template to be a function it simply swallows any arguments it is given and returns void. What we want to achieve is to duplicate the count parameter sizeof...(FIELDS) times and pass this parameter pack to the tuple’s perfect forwarding constructor. Such a pack is easily generated using the variadic expansion operator .... Sadly for this to work we have to have some kind of type-level dependency on the types in our pack which we do not really have when duplicating the count value (ignoring the number of times we want to duplicate). One kind of crafty way of getting a dependency anyway is to use the not very well known comma operator.

The comma operator forms a binary expression a, b that evaluates both a and b but returns only b. i.e. the expression (void_t(), count) depends on the types in the list FIELDS but swallows them without using them in favour of returning count. All in all this means that (void_t(), count)... will evaluate to a list of sizeof...(FIELDS) copies of count that are then passed as arguments to the tuple constructor. Note that if the field types are constructible we can also write e.g. (FIELDS(), count)... but this doesn’t work for my use case as I do not want my description-only field types to be runtime instantiable.

The next thing we might want to do after successfully constructing a MultiFieldArrayD is to access an individual FieldArrayD instance. If we know the index of the desired field in the variadic list this is easily done using a plain call to std::get. In practice I find that fields.get() both looks nicer than e.g. fields.get<1>() and is also self documenting which is always desirable. To do this we use the implicit assumption that types are not duplicated in our list and provide a recursive constexpr function to calculate the index:

template <
  typename WANTED_FIELD,
  typename CURRENT_FIELD,
  typename... FIELDS,
  // WANTED_FIELD equals the head of our field list, terminate recursion
  std::enable_if_t<std::is_same<WANTED_FIELD,CURRENT_FIELD>::value, int> = 0
>
constexpr unsigned getIndexInFieldList() {
  return 0;
}

template <
  typename WANTED_FIELD,
  typename CURRENT_FIELD,
  typename... FIELDS,
  // WANTED_FIELD doesn't equal the head of our field list
  std::enable_if_tstd::is_same<WANTED_FIELD,CURRENT_FIELD>::value, int> = 0
>
constexpr unsigned getIndexInFieldList() {
  // Break compilation when WANTED_FIELD is not provided by list of fields
  static_assert(sizeof...(FIELDS) > 0, "Field not found.");

  return 1 + getIndexInFieldList<WANTED_FIELD,FIELDS...>();
}

This could probably be written more compactly using e.g. a std::conditional_t alias template but this way we get a sensible assertion error when the field is not available. Furthermore as this function is also required in other areas of the field concept³ the actual call in MultiFieldArrayD reads rather well:

template <typename FIELD>
FieldArrayD<T,DESCRIPTOR,FIELD>& get() {
  return std::get<descriptors::getIndexInFieldList<FIELD,FIELDS...>()>(_data);
}

The concept of swallowing during variadic pack expansion can also be utilized to call a lambda expression for each value of the tuple. This is useful as a building block for writing e.g. intialization or data serialization code that commonly needs to iterate over all fields. For example consider an extract of a copy assignment operator for a facade class representing a single cell of a lattice:

template <typename T, typename DESCRIPTOR>
Cell<T,DESCRIPTOR>& Cell<T,DESCRIPTOR>::operator=(ConstCell<T,DESCRIPTOR>& rhs)
{
  /* [...] */
  this->_staticFieldsD.forFieldsAt(this->_iCell, [&rhs](auto field, auto id) {
    field = rhs.getFieldPointer(id);
  });
  /* [...] */

Or a code snippet to serialize all field data to a sequential buffer:

T* currData = data + DESCRIPTOR::template size<descriptors::POPULATION>();
this->_staticFieldsD.forFieldsAt(this->_iCell, [&currData](auto field, auto id) {
  for (unsigned iDim=0; iDim < decltype(field)::d; ++iDim) {
    *(currData++) = field[iDim];
  }
});

The common element of these examples is of course the call to forFieldsAt which is a template method of MultiFieldArrayD. As its structure suggests the generic lambda expression is called for each field instance that belongs to the index _iCell. The field argument is an instance of some structure that provides access to the correct row of the FieldArrayD instance belonging to the current field and id is an identifier that can be used to connect this back to the actual field type (as the field argument is a generic vector type that only carries the size of the row and not the field name).

template <typename F>
void forFieldsAt(std::size_t idx, F f) {
  utilities::meta::swallow(
    (f(get<FIELDS>().getFieldPointer(idx), utilities::meta::id<FIELDS>{}), 0)...
  );
}

As we can see the expectations towards such a forFieldsAt function are surprisingly easy to fullfill by using the swallow pattern. The utilities::meta::swallow function is needed here as variadic pack expansion in some sense needs a place to expand into. In our previous example this was the tuple constructor but as we do not need to construct something here, swallow fills the same niche.

/// Function equivalent of void_t, swallows any argument
template <typename... ARGS>
void swallow(ARGS&&...) { }

A closer look at the expanded comma operator expression shows that the function argument f is passed two arguments and the void result is dropped in favour of returning and subsequently swallowing zero. The first argument is the reference to the requested row of our SoA storage and the second argument is a helper class to work around the non-custructability of the field type in this specific situation. Note that invoking f using different argument types for each field works due to C++14’s generic lambda expressions. Any auto arguments are templatized in the generated function call operator of the lambda stub class.

template <typename TYPE>
struct id {
  using type = TYPE;
};

Using this identity wrapper struct enables us to employ C++’s template argument deduction rules to access the field type without knowing the corresponding template parameter name in our generic lambda.

template <typename T, typename DESCRIPTOR>
template <typename FIELD_ID>
VectorPtr<T,DESCRIPTOR::template size<typename FIELD_ID::type>()>
Cell<T,DESCRIPTOR>::getFieldPointer(FIELD_ID id)
{
  return getFieldPointer<typename FIELD_ID::type>();
}

In theory both field type and field value access could be combined in a single argument of the generic lambda expression passed to forFieldsAt but this would require field-specific VectorPtr instantiations in my specific situation.

All in all this article illustrates another step I took in my quest to generate efficient data structures for population and field data from a single high-level type description while preserving self-documentation and static handling of the memory layout without any need for the user to juggle around raw offsets. The specific swallow pattern used in this instance is something I feel will come in handy in even more situations in the future. It really is much more compact and readable than any equivalent implementation using e.g. indexing sequences would be.

Also not the first time on this blog, e.g. mapping arrays using tuples in 2014 or mapping binary structures as tuples in 2013.↩︎
Not done yet as we need to support various older compilers and HPC environments. e.g. Intel’s compiler tends to be problematic in this context but yields significant performance gains for large simulations.↩︎
See expressive meta templates for flexible handling of compile-time constants for further examples↩︎

A Year of Lattice Boltzmann

2019-12-31T00:00:01+02:00

To both not leave the 2010s behind with just one measly article in their last year and to showcase some of the stuff I am currently working on this article covers a bouquet of topics – spanning both math-heavy theory and practical software development as well as travels to new continents. As to retroactively befit the title this past year of mine was dominated by various topics in the field of Lattice Boltzmann Methods. CFD in general and LBM in particular have shaped to become the common denominator of my studies, my work and even my leisure time.

Grid refinement

The year began with the successful conclusion of my undergraduate studies of Mathematics at KIT. My corresponding Bachelor thesis discusses Grid refined Lattice Boltzmann Methods in OpenLB, in particular the approach taken by Lagrava et al. in Advances in Multi-domain Lattice Boltzmann Grid Refinement. The goal of such developments is to port one of the advantages of more classical approaches to fluid dynamics, namely Finite Element or Finite Volume methods, into the world of LBM: The ability to straight forwardly fit the discretizing mesh to the problem at hand. This feature is intrinsic to FEM as all computations are mapped from a physically embedded mesh of e.g. triangles into reference elements. The embedded mesh may be easily adapted to e.g. be more fine grained at boundaries or in other areas where the modeled fluid structures are more involved.

Doing this for the regular grids employed by LB implementations is more difficult in the sense that there is no intrinsic way to convert between differently resolved grids. Even more so it is not desirable to remove too much of the lattice structure regularity as this is one of the main aspects supporting the performance advantage which in turn is one of the method’s main selling points. On the theoretic side the main question is how to convert the population values at the contact surface between two differently resolved grids. Coming from a high resolution grid one has to decide how to restrict the more detailed information into a lower resolution and coming from a low resolution grid one has to find a way to recover the missing information compared to the targeted higher resolution. These questions are reflected directly in Lagrava’s approach by distinguishing between a restriction and an interpolation of the population’s non-equilibrium part.

The practical impact of my work during this thesis on OpenLB is a prototype implementation of grid refinement in 2D. In due time this will be expanded into a universally usable implementation for both two and three spatial dimensions but adding support for GPU-based computations to OpenLB currently enjoys a higher priority – but more on that later.

Symbolic code generation

As one of the seminars required for my Master degree I studied how symbolic optimization, specifically common subexpression elimination, can help to automatically generate high performing LB implementations. To fit the overarching goal of my work the chosen target architecture for this were GPGPUs such as Nvidia’s P100.

As is detailed in the corresponding report I was pleasantly surprised by the performance resulting from code generated by formulating the LB collision step in the SymPy CAS library and applying the offered CSE optimization.

CSE	D2Q9		D3Q19		D3Q27
	single	double	single	double	single	double
No	96.1%	75.7%	73.2%	55.9%	63.0%	51.3%
Yes	95.6%	96.4%	96.9%	98.7%	94.9%	99.8%

Just as an example the table above lists the achieved performance on a P100 compared to the theoretical maximum on this platform before and after eliminating common subexpressions. The newer the hardware I test this on is, the less hand-optimization of the kernel code seems to matter. This nicely mirrors the historic development of CPUs where the hardware got better and better at efficiently executing code that is not optimized for a specific target CPU.

One of my current main interests is to expand on these results to develop a general framework for automatic Lattice Boltzmann kernel generation. The boltzgen library marks my first steps in this direction and is also my first serious use case for the Python ecosystem. Whereas I was originally not very fond of Python as a language – the switch from Python 2 to 3 and the surrounding issues as well as the syntax shaped my opinions there – the development speed and ease of expression kind of won me over during the course of this year. If one is mainly plugging together existing frameworks and delegating work to the GPU the resulting code tends to be more pleasant than a comparable development in e.g. C++.

Meta templates and propagation patterns

Most of my working hours as a student employee of KIT’s Lattice Boltzmann Research Group were spent on two far reaching new developments: Implementing a template based framework for managing the memory of the various data fields required for LBM simulations and rewriting the essential Cell data structure into a pure data view. Details of the former are available in my article on Expressive meta templates for flexible handling of compile-time constants. The latter project lays the groundwork for my implementation of the Shift-Swap-Streaming propagation pattern that will be included in the next OpenLB release. This switch from the old collision-centric propagation pattern detailed by Mattila et al. in An Efficient Swap Algorithm for the Lattice Boltzmann Method to a new GPU- and vectorization-friendly algorithm is an important milestone in our ongoing quest to implement GPU-support in OpenLB. SSS is a very nice reformulation of the established single-grid A-A pattern into a plain collision step followed by changes to memory pointers in a central control structure. This means that streaming of information between neighboring lattice cells is not performed by explicitly moving memory around but rather by cunningly swapping and shifting some pointers. As an illustration:

Further details of this approach developed by Mohrhard et al. – in the same research group that I am currently working in – are available in An Auto-Vectorization Friendly Parallel Lattice Boltzmann Streaming Scheme for Direct Addressing.

Brazil

At the time that I am writing this article I’ve only been back in Germany for about two weeks as I had the great opportunity to spend three weeks in Brazil at the University of Rio Grande do Sul. There I amongst other things held a talk on the Efficient parallel implementation of Lattice Boltzmann Methods – of which the slides in the previous section are an extract – as part of a workshop jointly organized by LBRG and SBCB.

I very much enjoyed my time in Porto Alegre and had the chance to discover Brazil as a country that I’d really like to spend more time travelling in – just look at some of the views we had during a weekend trip to Torres…

…and the Itaimbezinho canyon near Cambara do Sul:

The joy of signed distance functions

After I ended up with a quite well performing GPU LBM code as a result of my seminar talk on symbolic code optimization I chose to expend some effort into developing nice looking real-time visualizations. Some of them are collected in my YouTube channel as well as linked behind the images in this section.

The quest to visualize three dimensional fluid flow led me into the field of computer graphics, specifically ray marching and signed distance functions. The former is useful when one considers the velocity field resulting from a simulation as a participating media through which light is shining while the latter may be used for describing, displaying and even voxelizing obstacle geometries.

For now the sources for these and other simulations still reside in a playground repository but one of my goals for the upcoming year is to further develop my own LB code based on the framework described in a previous section of this article. As an addition I also prototyped SDF-based indicator functions for OpenLB during my stay in Brazil and some form of support for this will be included in the upcoming release. Constructive solid geometry based on such functions offer a very flexible and information-rich concept for constructing simulation models. e.g. outer normals for certain boundary conditions are easily extracted from such a description.

As an example consider the full code of the grid fin geometry visualized above:

float sdf(vec3 v) {
  v = rotate_z(translate(v, v3(center.x/2, center.y, center.z)), -0.6);
  const float width = 1;
  const float angle = 0.64;

  return add(
    sadd(
      sub(
        rounded(box(v, v3(5, 28, 38)), 1),
        rounded(box(v, v3(6, 26, 36)), 1)
      ),
      cylinder(translate(v, v3(0,0,-45)), 5, 12),
      1
    ),
    sintersect(
      box(v, v3(5, 28, 38)),
      add(
        add(
          box(rotate_x(v, angle), v3(10, width, 100)),
          box(rotate_x(v, -angle), v3(10, width, 100))
        ),
        add(
          add(
            add(
              box(rotate_x(translate(v, v3(0,0,25)), angle), v3(10, width, 100)),
              box(rotate_x(translate(v, v3(0,0,25)), -angle), v3(10, width, 100))
            ),
            add(
              box(rotate_x(translate(v, v3(0,0,-25)), angle), v3(10, width, 100))
              ,
              box(rotate_x(translate(v, v3(0,0,-25)), -angle), v3(10, width, 100)
              )
            )
          ),
          add(
            add(
              box(rotate_x(translate(v, v3(0,0,50)), angle), v3(10, width, 100)),
              box(rotate_x(translate(v, v3(0,0,50)), -angle), v3(10, width, 100))
            ),
            add(
              box(rotate_x(translate(v, v3(0,0,-50)), angle), v3(10, width, 100))
              ,
              box(rotate_x(translate(v, v3(0,0,-50)), -angle), v3(10, width, 100)
              )
            )
          )
        )
      ),
      2
    )
  );
}

This quickly thrown together prototype is already somewhat reminiscent of how geometries are descibed by CSG-based CAD software packages such as OpenSCAD. As I just started out working on this I expect lots of further fun with this – and everthing else detailed in this article – for the upcoming year.