Suppose you have some kind of list of types. Such a list can by itself be used to perform any compile time computation one might come up with. So let us suppose that you additionally want to construct a tuple from something that is based on this list. i.e. you want to connect the compile time only type list to a run time object. In such a case you might run into new question such as: How do I call constructors for each of my tuple values? How do I offer access to the tuple values using only the type as a reference? How do I call a function for each value in the tuple while preserving the connection to the compile time list? If such questions are of interest to you, this article might possibly also be.

While the standard’s tuple template is part of the C++ subset I use in basically all of my developments^{1} I recently had to revisit some of these questions while reworking OpenLB’s core data structure using its *meta descriptor* concept. The starting point for this was a class template called `FieldArrayD`

to store an array of instances of a single field in a SIMD vectorization friendly *structure of arrays* layout. As a LBM lattice in practice stores not just one such field type but multiple of them (all declared in the central *descriptor* structure) I then wanted a `MultiFieldArrayD`

class template that does just that. i.e. a simple wrapper that accepts a list of fields as a variadic template parameter pack and instantiates a `FieldArrayD`

for each of them. A sensible place for storing these instances is of course our trusty `std::tuple`

:

/// SoA storage for instances of a single FIELD template<typename T, typename DESCRIPTOR, typename FIELD> struct FieldArrayD : public ColumnVector<T,DESCRIPTOR::template size<FIELD>()> { FieldArrayD(std::size_t count): ColumnVector<T,DESCRIPTOR::template size<FIELD>()>(count) { } /* [...] */ }; template<typename T, typename DESCRIPTOR, typename... FIELDS> class MultiFieldArrayD { private: std::tuple<FieldArrayD<T,DESCRIPTOR,FIELDS>...> _data; /* [...] */

A constructor for such a `MultiFieldArrayD`

class should now pass the same count of elements to each element constructor of the `_data`

tuple. This is more difficult than simply forwarding an individual value to each element which could be done using a common perfect forwarding pattern. But after some playing around I came up with a constructor

MultiFieldArrayD(std::size_t count): _count(count), // Trickery to construct each member of _data with `count`. // Uses the comma operator in conjunction with type dropping. _data((utilities::meta::void_t<FIELDS>(), count)...) { } { }

that does what I want in much more compact fashion that I expected at the beginning. Lets unwrap this: `utilities::meta::void_t`

is a place holder implementation of C++17’s `std::void_t`

that I use until we upgrade our C++14 code base^{2} to something more recent. In this case this somewhat aids the exposition as we can easily take a look at its definition:

template <typename...> using void_t = void;

If we consider this template to be a function it simply swallows any arguments it is given and returns `void`

. What we want to achieve is to duplicate the `count`

parameter `sizeof...(FIELDS)`

times and pass this parameter pack to the tuple’s perfect forwarding constructor. Such a pack is easily generated using the variadic expansion operator `...`

. Sadly for this to work we have to have some kind of type-level dependency on the types in our pack which we do not really have when duplicating the count value (ignoring the number of times we want to duplicate). One kind of crafty way of getting a dependency anyway is to use the not very well known comma operator.

The comma operator forms a binary expression `a, b`

that evaluates both `a`

and `b`

but returns only `b`

. i.e. the expression `(void_t<FIELDS>(), count)`

depends on the types in the list `FIELDS`

but swallows them without using them in favour of returning `count`

. All in all this means that `(void_t<FIELDS>(), count)...`

will evaluate to a list of `sizeof...(FIELDS)`

copies of `count`

that are then passed as arguments to the tuple constructor. Note that if the field types are constructible we can also write e.g. `(FIELDS(), count)...`

but this doesn’t work for my use case as I do not want my description-only field types to be runtime instantiable.

The next thing we might want to do after successfully constructing a `MultiFieldArrayD`

is to access an individual `FieldArrayD`

instance. If we know the index of the desired field in the variadic list this is easily done using a plain call to `std::get`

. In practice I find that `fields.get<FORCE>()`

both looks nicer than e.g. `fields.get<1>()`

and is also self documenting which is always desirable. To do this we use the implicit assumption that types are not duplicated in our list and provide a recursive constexpr function to calculate the index:

template < typename WANTED_FIELD, typename CURRENT_FIELD, typename... FIELDS, // WANTED_FIELD equals the head of our field list, terminate recursion std::enable_if_t<std::is_same<WANTED_FIELD,CURRENT_FIELD>::value, int> = 0 > constexpr unsigned getIndexInFieldList() { return 0; } template < typename WANTED_FIELD, typename CURRENT_FIELD, typename... FIELDS, // WANTED_FIELD doesn't equal the head of our field list std::enable_if_t<!std::is_same<WANTED_FIELD,CURRENT_FIELD>::value, int> = 0 > constexpr unsigned getIndexInFieldList() { // Break compilation when WANTED_FIELD is not provided by list of fields static_assert(sizeof...(FIELDS) > 0, "Field not found."); return 1 + getIndexInFieldList<WANTED_FIELD,FIELDS...>(); }

This could probably be written more compactly using e.g. a `std::conditional_t`

alias template but this way we get a sensible assertion error when the field is not available. Furthermore as this function is also required in other areas of the field concept^{3} the actual call in `MultiFieldArrayD`

reads rather well:

template <typename FIELD> FieldArrayD<T,DESCRIPTOR,FIELD>& get() { return std::get<descriptors::getIndexInFieldList<FIELD,FIELDS...>()>(_data); }

The concept of swallowing during variadic pack expansion can also be utilized to call a lambda expression for each value of the tuple. This is useful as a building block for writing e.g. intialization or data serialization code that commonly needs to iterate over all fields. For example consider an extract of a copy assignment operator for a facade class representing a single cell of a lattice:

template <typename T, typename DESCRIPTOR> Cell<T,DESCRIPTOR>& Cell<T,DESCRIPTOR>::operator=(ConstCell<T,DESCRIPTOR>& rhs) { /* [...] */ this->_staticFieldsD.forFieldsAt(this->_iCell, [&rhs](auto field, auto id) { field = rhs.getFieldPointer(id); }); /* [...] */

Or a code snippet to serialize all field data to a sequential buffer:

T* currData = data + DESCRIPTOR::template size<descriptors::POPULATION>(); this->_staticFieldsD.forFieldsAt(this->_iCell, [&currData](auto field, auto id) { for (unsigned iDim=0; iDim < decltype(field)::d; ++iDim) { *(currData++) = field[iDim]; } });

The common element of these examples is of course the call to `forFieldsAt`

which is a template method of `MultiFieldArrayD`

. As its structure suggests the generic lambda expression is called for each field instance that belongs to the index `_iCell`

. The `field`

argument is an instance of some structure that provides access to the correct row of the `FieldArrayD`

instance belonging to the current field and `id`

is an identifier that can be used to connect this back to the actual field type (as the `field`

argument is a generic vector type that only carries the size of the row and not the field name).

template <typename F> void forFieldsAt(std::size_t idx, F f) { utilities::meta::swallow( (f(get<FIELDS>().getFieldPointer(idx), utilities::meta::id<FIELDS>{}), 0)... ); }

As we can see the expectations towards such a `forFieldsAt`

function are surprisingly easy to fullfill by using the *swallow pattern*. The `utilities::meta::swallow`

function is needed here as variadic pack expansion in some sense needs a place to expand into. In our previous example this was the tuple constructor but as we do not need to construct something here, `swallow`

fills the same niche.

/// Function equivalent of void_t, swallows any argument template <typename... ARGS> void swallow(ARGS&&...) { }

A closer look at the expanded comma operator expression shows that the function argument `f`

is passed two arguments and the void result is dropped in favour of returning and subsequently swallowing zero. The first argument is the reference to the requested row of our SoA storage and the second argument is a helper class to work around the non-custructability of the field type in this specific situation. Note that invoking `f`

using different argument types for each field works due to C++14’s generic lambda expressions. Any `auto`

arguments are templatized in the generated function call operator of the lambda stub class.

template <typename TYPE> struct id { using type = TYPE; };

Using this identity wrapper struct enables us to employ C++’s template argument deduction rules to access the field type without knowing the corresponding template parameter name in our generic lambda.

template <typename T, typename DESCRIPTOR> template <typename FIELD_ID> VectorPtr<T,DESCRIPTOR::template size<typename FIELD_ID::type>()> Cell<T,DESCRIPTOR>::getFieldPointer(FIELD_ID id) { return getFieldPointer<typename FIELD_ID::type>(); }

In theory both field type and field value access could be combined in a single argument of the generic lambda expression passed to `forFieldsAt`

but this would require field-specific `VectorPtr`

instantiations in my specific situation.

All in all this article illustrates another step I took in my quest to generate efficient data structures for population and field data from a single high-level type description while preserving self-documentation and static handling of the memory layout without any need for the user to juggle around raw offsets. The specific *swallow pattern* used in this instance is something I feel will come in handy in even more situations in the future. It really is much more compact and readable than any equivalent implementation using e.g. indexing sequences would be.

Also not the first time on this blog, e.g.

*mapping arrays using tuples*in 2014 or*mapping binary structures as tuples*in 2013.↩︎Not done yet as we need to support various older compilers and HPC environments. e.g. Intel’s compiler tends to be problematic in this context but yields significant performance gains for large simulations.↩︎

See

*expressive meta templates for flexible handling of compile-time constants*for further examples↩︎

To both not leave the 2010s behind with just one measly article in their last year and to showcase some of the stuff I am currently working on this article covers a bouquet of topics – spanning both math-heavy theory and practical software development as well as travels to new continents. As to retroactively befit the title this past year of mine was dominated by various topics in the field of Lattice Boltzmann Methods. CFD in general and LBM in particular have shaped to become the common denominator of my studies, my work and even my leisure time.

The year began with the successful conclusion of my undergraduate studies of Mathematics at KIT. My corresponding Bachelor thesis discusses *Grid refined Lattice Boltzmann Methods in OpenLB*, in particular the approach taken by Lagrava et al. in *Advances in Multi-domain Lattice Boltzmann Grid Refinement*. The goal of such developments is to port one of the advantages of more classical approaches to fluid dynamics, namely Finite Element or Finite Volume methods, into the world of LBM: The ability to straight forwardly fit the discretizing mesh to the problem at hand. This feature is intrinsic to FEM as all computations are mapped from a physically embedded mesh of e.g. triangles into reference elements. The embedded mesh may be *easily* adapted to e.g. be more fine grained at boundaries or in other areas where the modeled fluid structures are more involved.

Doing this for the regular grids employed by LB implementations is more difficult in the sense that there is no intrinsic way to convert between differently resolved grids. Even more so it is not desirable to remove too much of the lattice structure regularity as this is one of the main aspects supporting the performance advantage which in turn is one of the method’s main selling points. On the theoretic side the main question is how to convert the population values at the contact surface between two differently resolved grids. Coming from a high resolution grid one has to decide how to restrict the more detailed information into a lower resolution and coming from a low resolution grid one has to find a way to recover the missing information compared to the targeted higher resolution. These questions are reflected directly in Lagrava’s approach by distinguishing between a restriction and an interpolation of the population’s non-equilibrium part.

The practical impact of my work during this thesis on OpenLB is a prototype implementation of grid refinement in 2D. In due time this will be expanded into a universally usable implementation for both two and three spatial dimensions but adding support for GPU-based computations to OpenLB currently enjoys a higher priority – but more on that later.

As one of the seminars required for my Master degree I studied how symbolic optimization, specifically common subexpression elimination, can help to automatically generate high performing LB implementations. To fit the overarching goal of my work the chosen target architecture for this were GPGPUs such as Nvidia’s P100.

As is detailed in the corresponding report I was pleasantly surprised by the performance resulting from code generated by formulating the LB collision step in the SymPy CAS library and applying the offered CSE optimization.

CSE | D2Q9 | D3Q19 | D3Q27 | |||
---|---|---|---|---|---|---|

single | double | single | double | single | double | |

No | 96.1% | 75.7% | 73.2% | 55.9% | 63.0% | 51.3% |

Yes | 95.6% | 96.4% | 96.9% | 98.7% | 94.9% | 99.8% |

Just as an example the table above lists the achieved performance on a P100 compared to the theoretical maximum on this platform before and after eliminating common subexpressions. The newer the hardware I test this on is, the less hand-optimization of the kernel code seems to matter. This nicely mirrors the historic development of CPUs where the hardware got better and better at efficiently executing code that is not optimized for a specific target CPU.

One of my current main interests is to expand on these results to develop a general framework for automatic Lattice Boltzmann kernel generation. The boltzgen library marks my first steps in this direction and is also my first serious use case for the Python ecosystem. Whereas I was originally not very fond of Python as a language – the switch from Python 2 to 3 and the surrounding issues as well as the syntax shaped my opinions there – the development speed and ease of expression kind of won me over during the course of this year. If one is mainly plugging together existing frameworks and delegating work to the GPU the resulting code tends to be more pleasant than a comparable development in e.g. C++.

Most of my working hours as a student employee of KIT’s Lattice Boltzmann Research Group were spent on two far reaching new developments: Implementing a template based framework for managing the memory of the various data fields required for LBM simulations and rewriting the essential Cell data structure into a pure data view. Details of the former are available in my article on *Expressive meta templates for flexible handling of compile-time constants*. The latter project lays the groundwork for my implementation of the *Shift-Swap-Streaming* propagation pattern that will be included in the next OpenLB release. This switch from the old collision-centric propagation pattern detailed by Mattila et al. in *An Efficient Swap Algorithm for the Lattice Boltzmann Method* to a new GPU- and vectorization-friendly algorithm is an important milestone in our ongoing quest to implement GPU-support in OpenLB. SSS is a very nice reformulation of the established single-grid A-A pattern into a plain collision step followed by changes to memory pointers in a central control structure. This means that streaming of information between neighboring lattice cells is not performed by explicitly moving memory around but rather by cunningly swapping and shifting some pointers. As an illustration:

Further details of this approach developed by Mohrhard et al. – in the same research group that I am currently working in – are available in *An Auto-Vectorization Friendly Parallel Lattice Boltzmann Streaming Scheme for Direct Addressing*.

At the time that I am writing this article I’ve only been back in Germany for about two weeks as I had the great opportunity to spend three weeks in Brazil at the University of Rio Grande do Sul. There I amongst other things held a talk on the *Efficient parallel implementation* of Lattice Boltzmann Methods – of which the slides in the previous section are an extract – as part of a workshop jointly organized by LBRG and SBCB.

I very much enjoyed my time in Porto Alegre and had the chance to discover Brazil as a country that I’d really like to spend more time travelling in – just look at some of the views we had during a weekend trip to Torres…

…and the Itaimbezinho canyon near Cambara do Sul:

After I ended up with a quite well performing GPU LBM code as a result of my seminar talk on symbolic code optimization I chose to expend some effort into developing nice looking real-time visualizations. Some of them are collected in my YouTube channel as well as linked behind the images in this section.

The quest to visualize three dimensional fluid flow led me into the field of computer graphics, specifically ray marching and signed distance functions. The former is useful when one considers the velocity field resulting from a simulation as a participating media through which light is shining while the latter may be used for describing, displaying and even voxelizing obstacle geometries.

For now the sources for these and other simulations still reside in a playground repository but one of my goals for the upcoming year is to further develop my own LB code based on the framework described in a previous section of this article. As an addition I also prototyped SDF-based indicator functions for OpenLB during my stay in Brazil and some form of support for this will be included in the upcoming release. Constructive solid geometry based on such functions offer a very flexible and information-rich concept for constructing simulation models. e.g. outer normals for certain boundary conditions are easily extracted from such a description.

As an example consider the full code of the grid fin geometry visualized above:

float sdf(vec3 v) { v = rotate_z(translate(v, v3(center.x/2, center.y, center.z)), -0.6); const float width = 1; const float angle = 0.64; return add( sadd( sub( rounded(box(v, v3(5, 28, 38)), 1), rounded(box(v, v3(6, 26, 36)), 1) ), cylinder(translate(v, v3(0,0,-45)), 5, 12), 1 ), sintersect( box(v, v3(5, 28, 38)), add( add( box(rotate_x(v, angle), v3(10, width, 100)), box(rotate_x(v, -angle), v3(10, width, 100)) ), add( add( add( box(rotate_x(translate(v, v3(0,0,25)), angle), v3(10, width, 100)), box(rotate_x(translate(v, v3(0,0,25)), -angle), v3(10, width, 100)) ), add( box(rotate_x(translate(v, v3(0,0,-25)), angle), v3(10, width, 100)) , box(rotate_x(translate(v, v3(0,0,-25)), -angle), v3(10, width, 100) ) ) ), add( add( box(rotate_x(translate(v, v3(0,0,50)), angle), v3(10, width, 100)), box(rotate_x(translate(v, v3(0,0,50)), -angle), v3(10, width, 100)) ), add( box(rotate_x(translate(v, v3(0,0,-50)), angle), v3(10, width, 100)) , box(rotate_x(translate(v, v3(0,0,-50)), -angle), v3(10, width, 100) ) ) ) ) ), 2 ) ); }

This quickly thrown together prototype is already somewhat reminiscent of how geometries are descibed by CSG-based CAD software packages such as OpenSCAD. As I just started out working on this I expect lots of further fun with this – and everthing else detailed in this article – for the upcoming year.

So we recently released a new version of OpenLB which includes a major refactoring of the central datastructure used to handle various kinds of compile-time constants required by the simulation. This article will summarize the motivation and design of this new concept as well as highlight a couple of tricks and pitfalls in the context of template metaprogramming.

Every simulation based on Lattice Boltzmann Methods can be characterized by a set of constants such as the modelled spatial dimension, the number of neighbors in the underlying regular grid, the weights used to compute equilibrium distributions or the lattice speed of sound. Due to OpenLB’s goal of offering a wide variety of LB models to address many different kinds of flow problems, the constants are not hardcoded throughout the codebase but rather maintained in compile-time data structures. Any usage of these constants can then refer to the characterizing descriptor data structure.

/// Old equilibrium implementation using descriptor data static T equilibrium(int iPop, T rho, const T u[DESCRIPTOR::d], const T uSqr) { T c_u = T(); for (int iD=0; iD < DESCRIPTOR::d; ++iD) { c_u += DESCRIPTOR::c[iPop][iD]*u[iD]; } return rho * DESCRIPTOR::t[iPop] * ( (T)1 + DESCRIPTOR::invCs2 * c_u + DESCRIPTOR::invCs2 * DESCRIPTOR::invCs2 * (T)0.5 * c_u * c_u - DESCRIPTOR::invCs2 * (T)0.5 * uSqr ) - DESCRIPTOR::t[iPop]; }

As many parts of the code do not actually care which specific descriptor is used, most classes and functions are templates that accept any user-defined descriptor type. This allows us to e.g. select descriptor specific optimizations^{1} via plain template specializations.

To continue, the descriptor concept is tightly coupled to the definition of the cells that make up the simulation lattice. The reason for this connection is that we require some place to store the essential per-direction population fields for each node of the lattice. In OpenLB this place is currently the `Cell`

class^{2} which locally maintains the population data and as such implements a collision-optimized *array of structures* memory layout. As a side note this was the initial motivation for rethinking the descriptor concept as we require more flexible structures to turn this into a more efficient *structures of arrays* situation^{3}.

To better appreciate the new concept we should probably first take a closer look at how stuff this stuff was implemented previously. As a starting point all descriptors were derived from a descriptor base type such as `D2Q9DescriptorBase`

for two dimensional lattices with nine discrete velocities:

template <typename T> struct D2Q9DescriptorBase { typedef D2Q9DescriptorBase<T> BaseDescriptor; enum { d = 2, q = 9 }; ///< number of dimensions/distr. functions static const int vicinity; ///< size of neighborhood static const int c[q][d]; ///< lattice directions static const int opposite[q]; ///< opposite entry static const T t[q]; ///< lattice weights static const T invCs2; ///< inverse square of speed of sound };

As we can see this is a plain struct template with some static member constants to store the data. This in itself is not problematic and worked just fine since the project’s inception. Note that the template allows for specification of the floating point type used for all non-integer data. This is required to e.g. use automatic differentiation types that allow for taking the derivative of the whole simulation in order to apply optimization techniques.

template<typename T> const int D2Q9DescriptorBase<T>::vicinity = 1; template<typename T> const int D2Q9DescriptorBase<T>::c [D2Q9DescriptorBase<T>::q][D2Q9DescriptorBase<T>::d] = { { 0, 0}, {-1, 1}, {-1, 0}, {-1,-1}, { 0,-1}, { 1,-1}, { 1, 0}, { 1, 1}, { 0, 1} }; template<typename T> const int D2Q9DescriptorBase<T>::opposite[D2Q9DescriptorBase<T>::q] = { 0, 5, 6, 7, 8, 1, 2, 3, 4 }; template<typename T> const T D2Q9DescriptorBase<T>::t[D2Q9DescriptorBase<T>::q] = { (T)4/(T)9, (T)1/(T)36, (T)1/(T)9, (T)1/(T)36, (T)1/(T)9, (T)1/(T)36, (T)1/(T)9, (T)1/(T)36, (T)1/(T)9 }; template<typename T> const T D2Q9DescriptorBase<T>::invCs2 = (T)3;

The actual data was stored in a separate header `src/dynamics/latticeDescriptors.hh`

. All in all this very straight forward approach worked as expected and could be fully resolved at compile time to avoid unnecessary run time jumps inside critical code sections as far as the descriptor concept is concerned. The real issue starts when we take a look at the so called *external fields*:

struct Force2dDescriptor { static const int numScalars = 2; static const int numSpecies = 1; static const int forceBeginsAt = 0; static const int sizeOfForce = 2; }; struct Force2dDescriptorBase { typedef Force2dDescriptor ExternalField; }; template <typename T> struct ForcedD2Q9Descriptor : public D2Q9DescriptorBase<T>, public Force2dDescriptorBase { };

Some LBM models require additional per-cell data such as external force vectors or values to model chemical properties. As we can see the declaration of these *external fields* is another task of the descriptor data structure and *the* task that was solved the ugliest in our original implementation.

// Set force vectors in all cells of material number 1 sLattice.defineExternalField( superGeometry, 1, DESCRIPTOR<T>::ExternalField::forceBeginsAt, DESCRIPTOR<T>::ExternalField::sizeOfForce, force );

For example this is a completely unsafe access to raw memory as `forceBeginsAt`

and `sizeOfForce`

define arbitrary memory offsets. And while we might not care about security in this context you can probably imagine the kinds of obscure bugs caused by potentially faulty and inconsistent handling of such offsets. To make things worse the naming of external field indices and size constants was inconsistent between different fields and stuff only worked as long as a unclear set of naming and layout conventions was followed.

If you want to risk an even closer look^{4} you can download version 1.2 or earlier and start your dive in `src/dynamics/latticeDescriptors.h`

. Otherwise we are going to continue with a description of the new approach.

The initial spark for the development of the new meta descriptor concept was the idea to define external fields as the parametrization of a multilinear function on the foundational `D`

and `Q`

constants of each descriptor^{5}. Lists of such functions could then be passed around via variadic template argument lists. This would then allow for handling external fields in a manner that is both flexible and consistent across all descriptors.

Before we delve into the details if how these expectations were implemented let us first take a look at how the basic `D2Q9`

descriptor is defined in the latest OpenLB release:

template <typename... FIELDS> struct D2Q9 : public DESCRIPTOR_BASE<2,9,POPULATION,FIELDS...> { typedef D2Q9<FIELDS...> BaseDescriptor; D2Q9() = delete; }; namespace data { template <> constexpr int vicinity<2,9> = 1; template <> constexpr int c<2,9>[9][2] = { { 0, 0}, {-1, 1}, {-1, 0}, {-1,-1}, { 0,-1}, { 1,-1}, { 1, 0}, { 1, 1}, { 0, 1} }; template <> constexpr int opposite<2,9>[9] = { 0, 5, 6, 7, 8, 1, 2, 3, 4 }; template <> constexpr Fraction t<2,9>[9] = { {4, 9}, {1, 36}, {1, 9}, {1, 36}, {1, 9}, {1, 36}, {1, 9}, {1, 36}, {1, 9} }; template <> constexpr Fraction cs2<2,9> = {1, 3}; }

These few compact lines^{6} describe the whole structure including all of its data. The various functions to access this data are auto-generated in a generic fashion using template metaprogramming and the previously verbose definition of a forced LB model reduces to a single self-explanatory type alias:

using ForcedD2Q9Descriptor = D2Q9<FORCE>;

Descriptor data is now exposed via an adaptable set of free functions templated on the descriptor type. This was required to satisfy a secondary goal of decoupling descriptor data definitions and accesses in order to add support for both transparent auto-generation and platform adaptation (i.e. adding workarounds for porting the code to the GPU).

/// Refactored generic equilibrium implementation static T equilibrium(int iPop, T rho, const T u[DESCRIPTOR::d], const T uSqr) { T c_u = T{}; for (int iD = 0; iD < DESCRIPTOR::d; ++iD) { c_u += descriptors::c<DESCRIPTOR>(iPop,iD) * u[iD]; } return rho * descriptors::t<T,DESCRIPTOR>(iPop) * ( T{1} + descriptors::invCs2<T,DESCRIPTOR>() * c_u + descriptors::invCs2<T,DESCRIPTOR>() * descriptors::invCs2<T,DESCRIPTOR>() * T{0.5} * c_u * c_u - descriptors::invCs2<T,DESCRIPTOR>() * T{0.5} * uSqr ) - descriptors::t<T,DESCRIPTOR>(iPop); }

The inclusion of the `descriptors`

namespace slightly increases the verbosity of functions such as the one above. If things get to bad we can use local namespace inclusion as a workaround. But even if this was not possible the transparent extensibility (i.e. the ability to customize the underlying implementation without changing all call sites) more than makes up for increasing the character count of some sections.

Back in 2013 I experimented with *mapping binary structures as tuples using template metaprogramming* in order to develop the foundations for a graph database. Surprisingly there were quite a few parallels between what I was doing then to what I am describing in this article. While I neither used the resulting BinaryMapping library for the development of GraphStorage nor ever used this then LevelDB-based graph *database* for more than a couple of basic examples, it was a welcome surprise to think back to my first steps doing more template-centered C++ programming.

/// Base descriptor of a D-dimensional lattice with Q directions and a list of additional fields template <unsigned D, unsigned Q, typename... FIELDS> struct DESCRIPTOR_BASE { /// Deleted constructor to enforce pure usage as type and prevent implicit narrowing conversions DESCRIPTOR_BASE() = delete; /// Number of dimensions static constexpr int d = D; /// Number of velocities static constexpr int q = Q; /* [...] */ };

As the description of any LBM model includes at least a number of spatial dimensions `D`

and a number of discrete velocities `Q`

these two constants are the required template arguments of the new `DESCRIPTOR_BASE`

class template^{7}. Until we finally get concepts in C++, the members of the `FIELDS`

list are by convention expected to offer a `size`

and `getLocalIndex`

template methods accepting these two foundational constants.

/// Base of a descriptor field whose size is defined by A*D + B*Q + C template <unsigned C, unsigned A=0, unsigned B=0> struct DESCRIPTOR_FIELD_BASE { /// Deleted constructor to enforce pure usage as type and prevent implicit narrowing conversions DESCRIPTOR_FIELD_BASE() = delete; /// Evaluates the size function template <unsigned D, unsigned Q> static constexpr unsigned size() { return A * D + B * Q + C; } /// Returns global index from local index and provides out_of_range safety template <unsigned D, unsigned Q> static constexpr unsigned getLocalIndex(const unsigned localIndex) { return localIndex < (A*D+B*Q+C) ? localIndex : throw std::out_of_range( "Index exceeds data field"); } };

Most^{8} fields use the `DESCRIPTOR_FIELD_BASE`

template as a base class. This template parametrizes the previously mentioned multilinear size function and allows for sharing field definitions between all descriptors.

// Field types need to be distinct (i.e. not aliases) in order for `DESCRIPTOR_BASE::index` to work // (Field size parametrized by: Cs + Ds*D + Qs*Q) Cs Ds Qs struct POPULATION : public DESCRIPTOR_FIELD_BASE<0, 0, 1> { }; struct FORCE : public DESCRIPTOR_FIELD_BASE<0, 1, 0> { }; struct SOURCE : public DESCRIPTOR_FIELD_BASE<1, 0, 0> { }; /* [...] */

Let us take the `FORCE`

field as an example^{9}: This field represents a cell-local force vector and as such requires exactly `D`

floating point values worth of storage. Correspondingly its base class is `DESCRIPTOR_FIELD_BASE<0,1,0>`

which yields a size of `2`

for two-dimensional and `3`

for three-dimensional descriptors.

Building upon this common field structure allows us to write down a `getIndexFromFieldList`

helper function template that automatically calculates the starting offset of any element in an arbitrary list of fields:

template < unsigned D, unsigned Q, typename WANTED_FIELD, typename CURRENT_FIELD, typename... FIELDS, // WANTED_FIELD equals the head of our field list, terminate recursion std::enable_if_t<std::is_same<WANTED_FIELD,CURRENT_FIELD>::value, int> = 0 > constexpr unsigned getIndexFromFieldList() { return 0; } template < unsigned D, unsigned Q, typename WANTED_FIELD, typename CURRENT_FIELD, typename... FIELDS, // WANTED_FIELD doesn't equal the head of our field list std::enable_if_t<!std::is_same<WANTED_FIELD,CURRENT_FIELD>::value, int> = 0 > constexpr unsigned getIndexFromFieldList() { // Break compilation when WANTED_FIELD is not provided by list of fields static_assert(sizeof...(FIELDS) > 0, "Field not found."); // Add size of current field to implicit offset and continue search // for WANTED_FIELD in the tail of our field list return CURRENT_FIELD::template size<D,Q>() + getIndexFromFieldList<D,Q, WANTED_FIELD,FIELDS...>(); }

As far as template metaprogramming is concerned this code is quite basic – we simply recursively traverse the variadic field list and sum up the field sizes along the way. This function is wrapped by the `DESCRIPTOR_BASE::index`

method template that exposes the memory offset of a given field. We are left with a generic interface that replaces our previous inconsistent and hard to maintain field offsets in the vein of `DESCRIPTOR::ExternalField::forceBeginsAt`

.

/// Returns index of WANTED_FIELD /** * Fails compilation if WANTED_FIELD is not contained in FIELDS. * Branching that depends on this information can be realized using `provides`. **/ template <typename WANTED_FIELD> static constexpr int index(const unsigned localIndex=0) { return getIndexFromFieldList<D,Q,WANTED_FIELD,FIELDS...>() + WANTED_FIELD::template getLocalIndex<D,Q>(localIndex); }

As we will see in the section on *improved field access* this method is not commonly used in user code but rather as a building block for self-documenting field accessors. One might notice that the abstraction layers are starting to pile up – luckily all of them are by themselves rather plain `constexpr`

function templates and can as such still be fully collapsed during compile time.

The alert reader might have noticed that the type of per-direction weight constants `descriptors::data::t`

was changed to `Fraction`

in our new meta descriptor. The reason for this is that we use variable templates to store these values and C++ sadly doesn’t allow partial specializations in this context. To elaborate, we are not allowed to write:

template <typename T> constexpr Fraction t<T,2,9>[9] = { T{4}/T{9}, T{1}/T{36}, T{1}/T{9}, T{1}/T{36}, T{1}/T{9}, T{1}/T{36}, T{1}/T{9}, T{1}/T{36}, T{1}/T{9} };

To work around this issue I wrote a small floating-point independent fraction type:

class Fraction { private: const int _numerator; const int _denominator; public: /* [...] */ template <typename T> constexpr T as() const { return T(_numerator) / T(_denominator); } template <typename T> constexpr T inverseAs() const { return _numerator != 0 ? T(_denominator) / T(_numerator) : throw std::invalid_argument("inverse of zero is undefined"); } };

This works well for both integral and automatically differentiable floating point types and even yields a more pleasant syntax for defining fractional descriptor values due to C++’s implicit constructor calls. One remaining hiccup is the representation of values such as square roots that are not easily expressed as readable rational numbers. Such weights are required by some more exotic LB models and currently stored by explicit specialization for any required type. A slightly surprising fact in this context is that the C++ standard doesn’t require some functions such as `std::sqrt`

to be `constexpr`

. This problem remained undetected for quite a while as e.g. GCC fixes this issue in a non-standard extension. So in the long term we are going to have to invest some more effort into adding compile-time math functions in the vein of GCEM.

As I hinted previously one major change besides the refactoring of the actual descriptor structure was the introduction of an abstraction layer between data and call sites. i.e. where we previously wrote `DESCRIPTOR<T>::t[i]`

to directly access the ith weight we now call a free function `descriptors::t<T,DESCRIPTOR>(i)`

. The advantage of this additional layer is the ability to transparently switch out the underlying data source. Furthermore we can easily expand such free functions to distinguish between various descriptor specializations at compile time via tagging.

template <typename T, unsigned D, unsigned Q> constexpr T t(unsigned iPop, tag::DEFAULT) { return data::t<D,Q>[iPop].template as<T>(); } template <typename T, typename DESCRIPTOR> constexpr T t(unsigned iPop) { return t<T, DESCRIPTOR::d, DESCRIPTOR::q>(iPop, typename DESCRIPTOR::category_tag()); }

This powerful concept uses C++’s function overload resolution to transparently call different implementations based on the given template arguments in a very compact fashion. As an example we can mark a descriptor using some non-default tag `tag::SPECIAL`

and implement a function `T t(unsigned iPop, tag::SPECIAL)`

to do some *special* stuff for this descriptor – the definition of both the tag and its function overload can be written anywhere in the codebase and will be automatically resolved by the generic implementation. This adds a whole new level of extensibility to OpenLB and is currently used to e.g. handle the special requirements of MRT LBM models.

One might have noticed that we accessed a `DESCRIPTOR::category_tag`

typedef to select the correct function overload. While the canonical way to do function tagging is to simply define this type on a case by case basis in any tagged structure, I chose to develop something slightly more sophisticated: Tags are represented as special zero-size fields^{10} and passed to the descriptor specialization alongside any other fields. This feels quite nice and results in a very expressive and self-documenting interface for defining new descriptors.

/// Base of a descriptor tag struct DESCRIPTOR_TAG { template <unsigned, unsigned> static constexpr unsigned size() { return 0; // a tag doesn't have a size } };

As such `DESCRIPTOR_BASE`

is the only place where the `category_tag`

type is defined. To do this we filter the given list of fields and select the first *tag-field* that is derived from our desired *tag-group* `tag::CATEGORY`

.

template <typename BASE, typename FALLBACK, typename... FIELDS> using field_with_base = typename std::conditional< std::is_void<typename utilities::meta::list_item_with_base<BASE, FIELDS...>:: type>::value, FALLBACK, typename utilities::meta::list_item_with_base<BASE, FIELDS...>::type >::type; /* [...] */ using category_tag = tag::field_with_base< tag::CATEGORY, tag::DEFAULT, FIELDS...>;

In order to implement the `utilities::meta::list_item_with_base`

meta template I referred back to the *Scheme metaphor for template metaprogramming* which results in a readable filtering operation based on the tools offered by the standard library’s type traits:

/// Get first type based on BASE contained in a given type list /** * If no such list item exists, type is void. **/ template < typename BASE, typename HEAD = void, // Default argument in case the list is empty typename... TAIL > struct list_item_with_base { using type = typename std::conditional< std::is_base_of<BASE, HEAD>::value, HEAD, typename list_item_with_base<BASE, TAIL...>::type >::type; }; template <typename BASE, typename HEAD> struct list_item_with_base<BASE, HEAD> { using type = typename std::conditional< std::is_base_of<BASE, HEAD>::value, HEAD, void >::type; };

The last remaining cornerstone of OpenLB’s new meta descriptor concept is the introduction of a set of convenient functions to access a cell’s field values via the field’s name. By taking this final step we get the ability to write simulation code that doesn’t handle any raw memory offsets in addition to being more compact. Furthermore we can now in theory completely modify the underlying field storage structures without forcing the user code to change.

/// Return pointer to FIELD of cell template <typename FIELD, typename X = DESCRIPTOR> std::enable_if_t<X::template provides<FIELD>(), T*> getFieldPointer() { const int offset = DESCRIPTOR::template index<FIELD>(); return &(this->data[offset]); } template <typename FIELD, typename X = DESCRIPTOR> std::enable_if_t<!X::template provides<FIELD>(), T*> getFieldPointer() { throw std::invalid_argument("DESCRIPTOR does not provide FIELD."); return nullptr; }

The foundation of all field accessors is a new `Cell::getFieldPointer`

method template that resolves the field location using the `DESCRIPTOR_BASE::index`

and `DESCRIPTOR_BASE::size`

functions we defined previously. Note that we had to loosen our newly gained compile-time guarantee of a field’s existence in favour of generating runtime exception code. The reason for this is that most current builds include code that depends on a certain set of fields even if those fields are not actually provided by a given descriptor. While we are going to resolve this unsatisfying situation in the future, this workaround offered an acceptable compromise.

/// Set value of FIELD from a vector template <typename FIELD, typename X = DESCRIPTOR> std::enable_if_t<(X::template size<FIELD>() > 1), void> setField(const Vector<T,DESCRIPTOR::template size<FIELD>()>& field) { std::copy_n( field.data, DESCRIPTOR::template size<FIELD>(), getFieldPointer<FIELD>()); } /// Set value of FIELD from a scalar template <typename FIELD, typename X = DESCRIPTOR> std::enable_if_t<(X::template size<FIELD>() == 1), void> setField(T value) { getFieldPointer<FIELD>()[0] = value; }

Note that disabling a member function specialization depending on its parent’s template arguments is only possible with some indirection: The parent template argument `DESCRIPTOR`

is passed as the default value to the member function’s `X`

argument. This parameter can then be used by `std::enable_if`

as one would expect.

It is probably clear that the set of changes summarized so far mark a far reaching revamp of the existing codebase – in fact there was scarcely a file untouched after I got everything to work again. As we do not live in an ideal world where I could have developed this in isolation while any other developments are stopped, both the initial prototype and the following rollout to all of OpenLB had to be developed on a seperate branch. Due to the additional hindrance that I am not actually working anywhere close to full-time on this^{11} these changes took quite a few months from inception to full realization. Correspondingly the meta descriptor and master branch had diverged significantly by the time we felt ready to merge – you can imagine how unpleasant it was to fiddle this back together.

I found the three-way merge functionality offered by Meld to be a most useful tool during this endeavour. My fingers were still twitching in a rhythmic pattern after two days of using this utility to more or less manually merge everything back together but it was still worlds better than the alternative of e.g. resolving the conflicts in a normal text editor.

Sadly even in retrospect I can not think of a better alternative to letting the branches diverge this far: A significant chunk of all lines had to be changed in randomly non-trivial ways and there was no discrete point in between where you could push these changes to the rest of the team with a good conscience. At least further changes to e.g. the foundational cell data structures should now prove to be significantly easier than they would have been without this refactor.

All in all I am quite satisfied with how this new concept turned out in practice: The code is smaller and more self-documenting while growing in extensibility and consistency. The internally increased complexity is restricted to a set of classes and meta templates that the ordinary user that just wants to write a simulation should never come in contact with. Some listings in this article might look cryptic at first but as far as template metaprogramming goes this is still reasonable – we did not run into any serious portability issues and everything works as expected in GCC, Clang and Intel’s C++ compiler^{12}.

To conclude things I want to encourage everyone to check out the latest OpenLB release to see these and other interesting new features in practive. Should this article have awoken any interest in CFD using Lattice Boltzmann Methods, a fun introduction is provided by my previous article on just this topic.

e.g. collision steps where all generic code is resolved using common subexpression elimination in order to minimze the number of floating point operations↩︎

see

`src/core/cell.h`

for further reading↩︎The performance LBM codes is in general not bound by the available processing power but but rather by how well we utilize the available memory bandwidth. i.e. we want to optimize memory throughput as much as possible which leads us to the need for more efficient streaming steps that in turn require changes to the memory layout.↩︎

Note that this examination of the issues with the previous descriptor concept is not aimed to be a strike at its original developers but rather as an example of how things can get out of hand when expanding a initial concept to cover more and more stuff. As far as legacy code is concerned this is still relatively tame and obviously the niceness of such scaffolding for the actual simulation is a side show when one first and foremost wants to generate new results.↩︎

i.e. each field describes its size as a function $f : \mathbb{N}_0^3 \to \mathbb{N}_0, (a,b,c) \mapsto a + b D + c Q$↩︎

See

`src/dynamics/latticeDescriptors.h`

↩︎See

`src/dynamics/descriptorBase.h`

↩︎e.g. there is also a

`TENSOR`

base template that encodes the size of a tensor of order`D`

(which is not a linear function)↩︎Common field definitions are collected in

`src/dynamics/descriptorField.h`

↩︎See

`src/dynamics/descriptorTag.h`

↩︎After all I am still primarily a mathematics student↩︎

I was surprised to learn how big of an advantage the Intel compiler can provide: In some settings the generated code runs up to 20 percent faster compared to what GCC or Clang produce.↩︎

As I previously alluded to, computational fluid dynamics is a current subject of interest of mine both academically^{1} and recreationally^{2}. Where on the academic side the focus obviously lies on theoretical strictness and simulations are only useful as far as their error can be judged and bounded, I very much like to take a more hand wavy approach during my freetime and just *fool around*. This works together nicely with my interest in GPU based computation which is to be the topic of this article.

While visualizations such as the one above are nice to behold in a purely asthetic sense independent of any real word groundedness their implementation is at least inspired by models of our physical reality. The next section aims to give a overview of such models for fluid flows and at least sketch out the theoretical foundation of the specific model implemented on the GPU to generate all visualization we will see on this page.

The behaviour of weakly compressible fluid flows – i.e. non-supersonic flows where the compressibility of the flowing fluid plays a small but *non-central* role – is commonly modelled by the weakly compressible Navier-Stokes equations which relate density $\rho$, pressure $p$, viscosity $\nu$ and speed $u$ to each other:

$\begin{aligned} \partial_t \rho + \nabla \cdot (\rho u) &= 0 \\ \partial_t u + (u \cdot \nabla) u &= -\frac{1}{\rho} \nabla p + 2\nu\nabla \cdot \left(\frac{1}{2} (\nabla u + (\nabla u)^\top)\right)\end{aligned}$

As such the Navier-Stokes equations model a continuous fluid from a macroscopic perspective. That means that this model doesn’t concern itself with the inner workings of the fluid – e.g. what it is actually made of, how the specific molecules making up the fluid interact individually and so on – but rather considers it as an abstract vector field. One other way to model fluid flows is to explicitly model the individual fluid molecules using classical physics. This microscopic approach closely reflects what actually happens in reality. From this perspective the *flow* of the fluid is just an emergent property of the underlying individual physical interactions. Which approach one chooses for computational fluid dynamics depends on the question one wants to answer as well as the available computational ressources. A sufficienctly precise model of individual molecular interactions precisely models physical reality in arbitrary situations but is easily much more computationally intensive that a macroscopic approach using Navier-Stokes. In turn, solving such macroscoping equations can quickly become problematic in complex geometries with diverse boundary conditions. No model is perfect and no model is strictly better that any other model in all categories.

The approach I want to introduce for this article is neither macroscopic nor microscopic but situated between those two levels of abstraction – it is a *mesoscopic* approach to fluid dynamics. Such a model is given by the Boltzmann equations that can be used to describe fluids from a statistical perspective. As such the *Boltzmann-approach* is to model neither the macroscopic behavior of a fluid nor the microscopic particle interactions but the probability of a certain mass of fluid particles $f$ moving inside of an external force field $F$ with a certain directed speed $\xi$ at a certain spatial location $x$ at a specific time $t$:

$\left( \partial_t + \xi \cdot \partial_x + \frac{F}{\rho} \cdot \partial_\xi \right) f = \Omega(f) \left( = \partial_x f \cdot \frac{dx}{dt} + \partial_\xi f \cdot \frac{d\xi}{dt} + \partial_t f \right)$

The total differential $\Omega(f)$ of this Boltzmann advection equation can be viewed as a collision operator that describes the local redistribution of particle densities caused by said particles colliding. As this equation by itself is still continuous in all variables we need to discretize it in order to use it on a finite computer. This basically means that we restrict all variable values to a discrete and finite set in addition to replacing difficult to solve parts with more approachable approximations. Implementations of such a discretized Boltzmann equation are commonly referred to as the Lattice Boltzmann Method.

As our goal is to display simple fluid flows on a distinctly two dimensional screen, a first sensible restiction is to limit space to two dimensions^{3}. As a side note: At first glance this might seem strange as no truly 2D fluids exist in our 3D environment. While this doesn’t need to concern us for generating entertaining visuals there are in fact some real world situations where 2D fluid models can be reasonable solutions for 3D problems.

The lattice in LBM hints at the further restriction of our 2D spatial coordinate $x$ to a discrete lattice of points. The canonical way to structure such a lattice is to use a cartesian grid.

Besides the spatial restriction to a two dimensional lattice a common step of discretizing the Boltzmann equation is to approximate the collision operator using an operator pioneered by Bhatnagar, Gross and Krook:

$\Omega(f) := -\frac{f-f^\text{eq}}{\tau}$

This honorifically named BGK operator relaxes the current particle distribution $f$ towards its theoretical equilibrium distribution $f^\text{eq}$ at a rate $\tau$. The value of $\tau$ is one of the main control points for influencing the behaviour of the simulated fluid. e.g. its Reynolds number^{4} and viscosity are controlled using this parameter. Combining this definition of $\Omega(f)$ and the Boltzmann equation without external forces yields the BGK approximation of said equation:

$(\partial_t + \xi \cdot \nabla_x) f = -\frac{1}{\tau} (f(x,\xi,t) - f^\text{eq}(x,\xi,t))$

To further discretize this we restrict the velocity $\xi$ not just to two dimensions but to a finite set of nine discrete unit velocities (`D2Q9` - 2 dimensions, 9 directions):

$\newcommand{\V}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}} \{\xi_i\}_{i=0}^8 = \left\{ \V{0}{0}, \V{-1}{\phantom{-}1}, \V{-1}{\phantom{-}0}, \V{-1}{-1}, \V{\phantom{-}0}{-1}, \V{\phantom{-}1}{-1}, \V{1}{0}, \V{1}{1}, \V{0}{1} \right\}$

We also define the equilibrium $f^\text{eq}$ towards which all distributions in this model strive as the discrete equilibrium distribution by Maxwell and Boltzmann. This distribution $f_i^\text{eq}$ of the $i$-th discrete velocity $\xi_i$ is given for density $\rho \in \mathbb{R}_{\geq 0}$ and total velocity $u \in \mathbb{R}^2$ as well as fixed lattice weights $w_i$ and lattice speed of sound $c_s$:

$f_i^\text{eq} = w_i \rho \left( 1 + \frac{u \cdot \xi_i}{c_s^2} + \frac{(u \cdot \xi_i)^2}{2c_s^4} - \frac{u \cdot u}{2c_s^2} \right)$

The moments $\rho$ and $u$ at location $x$ are in turn dependent on the cumulated distributions:

$\begin{aligned}\rho(x,t) &= \sum_{i=0}^{q-1} f_i(x,t) \\ \rho u(x,t) &= \sum_{i=0}^{q-1} \xi_i f_i(x,t)\end{aligned}$

Verbosely determining the constant lattice weights and the lattice speed of sound would exceed the scope^{5} of this article. Generally these constants are chosen depending of the used set of discrete velocities in such a way that the resulting collision operator preserves both momentum and mass. Furthermore the operator should be independent of rotations.

$w_0 = \frac{4}{9}, \ w_{2,4,6,8} = \frac{1}{9}, \ w_{1,3,5,7} = \frac{1}{36}, \ c_s = \sqrt{1/3}$

We have now fully discretized the BGK approximation of the Boltzmann equation. As the actual solution to this equation is still implicit in its definition we need to solve the following definite integral of time and space:

$f_i(x+\xi_i, t+1) - f_i(x,t) = -\frac{1}{\tau} \int_0^1 (f_i(x+\xi_i s,t+s) - f_i^\text{eq}(x+\xi_i s, t+s)) ds$

Since the exact integration of this expression is actually non-trivial it is once again only approximated. While there are various ways of going about that we can get away with using the common trapezoidial rule and the following shift of $f_i$ and $\tau$:

$\begin{aligned}\overline{f_i} &= f_i + \frac{1}{2\tau}(f_i - f_i^\text{eq}) \\ \overline\tau &= \tau + \frac{1}{2}\end{aligned}$

Thus we finally end up with a discrete LBM BGK equation that can be trivially performed – i.e. there is is a explicit function for transforming the current state into its successor – on any available finite computer:

$\overline{f_i}(x+\xi_i,t+1) = \overline{f_i}(x,t) - \frac{1}{\overline\tau} (\overline{f_i}(x,t) - f_i^\text{eq}(x,t))$

Note that on an infinite or periodic (e.g. toroidial) lattice this equation defines all distributions in every lattice cell. If we are confronted with more complex situations such as borders where the fluid is reflected or open boundaries where mass enters or leaves the simulation domain we need special boundary conditions to model the missing distributions. Boundary conditions are also one of the big subtopics in LBM theory as there isn’t one condition to rule them all but a plethora of different boundary conditions with their own up and downsides.

The ubiquitous way of applying the discrete LBM equation to a lattice is to separate it into a two step *Collide-and-Stream* process:

$\begin{aligned}f_i^\text{out}(x,t) &:= f_i(x,t) - \frac{1}{\tau}(f_i(x,t) - f_i^\text{eq}(x,t)) &&\text{(Collide)} \\ f_i(x+\xi_i,t+1) &:= f_i^\text{out}(x,t) &&\text{(Stream)}\end{aligned}$

Closer inspection of this process reveals one of the advantages of LBM driven fluid dynamics: They positively beg for parallelization. While the collision step is embarrassingly parallel due to its fully cell-local nature even the stream step only communicates with the cell’s direct neighbors.

One might note that the values of our actual distributions $f_i$ are – contrary to the stated goal of the previous section – still unrestricted, non-discrete and unbounded real numbers. Their discretization happens implicitly by choosing the floating point type used by our program. In the case of the following compute shaders all these values will be encoded as 4-byte single-precision floating point numbers as is standard for GPU code.

To implement a LBM using compute shaders we need to represent the lattice in the GPU’s memory. Each lattice cell requires nine 4-byte floating point numbers to describe its distribution. This means that in 2D the lattice memory requirement by itself is fairly negligible as e.g. a lattice resolution of `1024x1024` fits within 36 MiB and thus takes up only a small fraction of the onboard memory provided by current GPUs. In fact GPU memory and processors are fast enough that we do not really have to concern ourselves with detailed optimizations^{6} if we only want to visualize a reasonably sized lattice with a reasonable count of lattice updates per second – e.g. 50 updates per second on a `256x256` lattice do not require^{7} any thoughts on optimization whatsoever on the Nvidia K2200 employed by my workstation.

Despite all actual computation happening on the GPU we still need some CPU-based wrapper code to interact with the operating system, initialize memory, control the OpenGL state machine and so on. While I could not find any suitable non-gaming targeted C++ library to ease development of this code the scaffolding originally written^{8} for my vector field visualization computicle was easily adapted to this new application.

To further simplify the implementation of our GLSL stream kernel we can use the abundant GPU memory to store two full states of the lattice. This allows for updating the cell populations of the upcoming collide operation without overwriting the current collision result which in turn means that the execution sequence of the stream kernel doesn’t matter.

So all in all we require three memory regions: A collision buffer for performing the collide step, a streaming buffer as the streaming target and a fluid buffer to store velocity and pressure for visualization purposes. As an example we can take a look at how the underlying lattice buffer for collide and stream is allocated on the GPU:

LatticeCellBuffer::LatticeCellBuffer(GLuint nX, GLuint nY) { glGenVertexArrays(1, &_array); glGenBuffers(1, &_buffer); const std::vector<GLfloat> data(9*nX*nY, GLfloat{1./9.}); glBindVertexArray(_array); glBindBuffer(GL_ARRAY_BUFFER, _buffer); glBufferData( GL_ARRAY_BUFFER, data.size() * sizeof(GLfloat), data.data(), GL_DYNAMIC_DRAW ); glEnableVertexAttribArray(0); glVertexAttribPointer(0, 1, GL_FLOAT, GL_FALSE, 0, nullptr); }

We can use the resulting `_buffer`

address of type `GLuint`

to bind the data array to corresponding binding points inside the compute shader. In our case these binding points are defined as follows:

layout (local_size_x = 1, local_size_y = 1) in; layout (std430, binding=1) buffer bufferCollide{ float collideCells[]; }; layout (std430, binding=2) buffer bufferStream{ float streamCells[]; }; layout (std430, binding=3) buffer bufferFluid{ float fluidCells[]; }; uniform uint nX; uniform uint nY;

Calling compute shaders of this signature from the CPU is nicely abstracted by some computicle-derived^{9} wrapper classes such as `ComputeShader`

:

// vector of buffer addresses to be bound auto buffers = { lattice_a->getBuffer(), lattice_b->getBuffer(), fluid->getBuffer() }; // bind buffers for the shaders to work on collide_shader->workOn(buffers); stream_shader->workOn(buffers); // activate and trigger compute shaders { auto guard = collide_shader->use(); collide_shader->dispatch(nX, nY); } { auto guard = stream_shader->use(); stream_shader->dispatch(nX, nY); }

Lattice constants can be stored directly in the shader:

const uint q = 9; const float weight[q] = float[]( 1./36., 1./9., 1./36., 1./9. , 4./9., 1./9. , 1./36 , 1./9., 1./36. ); const float tau = 0.8; const float omega = 1/tau;

Manual indexing to mime multidimensional arrays allows for flexible memory layouting while preserving reasonably easy access:

uint indexOfDirection(int i, int j) { return 3*(j+1) + (i+1); } uint indexOfLatticeCell(uint x, uint y) { return q*nX*y + q*x; } /* [...] */ float w(int i, int j) { return weight[indexOfDirection(i,j)]; } float get(uint x, uint y, int i, int j) { return collideCells[indexOfLatticeCell(x,y) + indexOfDirection(i,j)]; }

The discrete equilibrium distribution $f_i^\text{eq}$ is expressed as a single line of code when aided by some convenience functions such as `comp`

for the dot product of discrete velocity $\xi_i$ and velocity moment $u$:

float equilibrium(float d, vec2 u, int i, int j) { return w(i,j) * d * (1 + 3*comp(i,j,u) + 4.5*sq(comp(i,j,u)) - 1.5*sq(norm(u))); }

Our actual collide kernel `collide.glsl` is compactly expressed as a iteration over all discrete velocities and a direct codificaton of the collision formula:

const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const float d = density(x,y); const vec2 v = velocity(x,y,d); setFluid(x,y,v,d); for ( int i = -1; i <= 1; ++i ) { for ( int j = -1; j <= 1; ++j ) { set( x,y,i,j, get(x,y,i,j) + omega * (equilibrium(d,v,i,j) - get(x,y,i,j)) ); } }

The streaming kernel `stream.glsl` turns out to be equally compact even when a basic bounce back boundary condition is included. Such a condition simply reflects the populations that would be streamed outside the fluid domain to define the – otherwise undefined – populations pointing towards the fluid.

const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; if ( x != 0 && x != nX-1 && y != 0 && y != nY-1 ) { for ( int i = -1; i <= 1; ++i ) { for ( int j = -1; j <= 1; ++j ) { set(x+i,y+j,i,j, get(x,y,i,j)); } } } else { // rudimentary bounce back boundary handling for ( int i = -1; i <= 1; ++i ) { for ( int j = -1; j <= 1; ++j ) { if ( (x > 0 || i >= 0) && x+i <= nX-1 && (y > 0 || j >= 0) && y+j <= nY-1 ) { set(x+i,y+j,i,j, get(x,y,i,j)); } else { set(x,y,i*(-1),j*(-1), get(x,y,i,j)); } } } }

We can now use the two compute shaders to simulate 2D fluids on the GPU. Sadly we are still missing some way to display the results on our screen so we will not see anything. Luckily all data required to amend this situation already resides on the GPU’s memory within easy reach of video output.

The vertex array containing the fluid’s moments encoded in a 3D vector we wrote to during every collision can be easily passed to a graphic shader:

auto guard = scene_shader->use(); // pass projection matrix MVP and lattice dimensions scene_shader->setUniform("MVP", MVP); scene_shader->setUniform("nX", nX); scene_shader->setUniform("nY", nY); // draw to screen glClear(GL_COLOR_BUFFER_BIT); glBindVertexArray(fluid_array); glDrawArrays(GL_POINTS, 0, _nX*_nY);

In this case the graphic shader consists of three stages: A vertex shader to place the implicitly positioned fluid vertices in screen space, a geometry shader to transform point vertices into quads to be colored and a fragment shader to apply the coloring.

const vec2 idx = fluidVertexAtIndex(gl_VertexID); gl_Position = vec4( idx.x - nX/2, idx.y - nY/2, 0., 1. ); vs_out.color = mix( vec3(-0.5, 0.0, 1.0), vec3( 1.0, 0.0, 0.0), displayAmplifier * VertexPosition.z * norm(VertexPosition.xy) );

This extract of the first `vertex.glsl` stage reverses the implicit positioning by array index to the actual spacial location of the fluid cells and mixes the color scheme for displaying the velocity norm weighted by its density.

layout (points) in; layout (triangle_strip, max_vertices=4) out; uniform mat4 MVP; in VS_OUT { vec3 color; } gs_in[]; out vec3 color; vec4 project(vec4 v) { return MVP * v; } void emitSquareAt(vec4 position) { const float size = 0.5; gl_Position = project(position + vec4(-size, -size, 0.0, 0.0)); EmitVertex(); gl_Position = project(position + vec4( size, -size, 0.0, 0.0)); EmitVertex(); gl_Position = project(position + vec4(-size, size, 0.0, 0.0)); EmitVertex(); gl_Position = project(position + vec4( size, size, 0.0, 0.0)); EmitVertex(); } void main() { color = gs_in[0].color; emitSquareAt(gl_in[0].gl_Position); EndPrimitive(); }

`geometry.glsl` projects these fluid cells that where up until now positioned in lattice space into the screen’s coordinate system via the `MVP`

matrix. Such geometry shaders are very flexible as we can easily adapt a fixed point vertex based shader interface into different visualization geometries.

This more abstract visualization embedded in its moving glory at the start of this article was generated in the same way by simply spatially shifting the fluid cells by their heavily amplified velocities instead of only coloring them.

As we are displaying a simulated universe for pure entertainment purposes we have *some* leeway in what laws we enforce. So while in practical simulations we would have to carefully handle any external influences to enforce e.g. mass preservation, on our playground nobody prevents us from simply dumping energy into the system at the literal twitch of a finger:

Even though this interactive ~~sand~~fluidbox is as simple as it gets everyone who has ever played around with falling sand games in the vein of powder toy will know how fun such contained physical models can be. Starting from the LBM code developed during this article it is but a small step to add mouse-based interaction. In fact the most complex step is transforming the on-screen mouse coordinates into lattice space to identify the nodes where density has to be added during collision equilibration. The actual external intervention into our lattice state is trivial:

float getExternalPressureInflux(uint x, uint y) { if ( mouseState == 1 && norm(vec2(x,y) - mousePos) < 4 ) { return 1.5; } else { return 0.0; } } /* [...] */ void main() { const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const float d = max(getExternalPressureInflux(x,y), density(x,y)); const vec2 v = velocity(x,y,d); setFluid(x,y,v,d); for ( int i = -1; i <= 1; ++i ) { for ( int j = -1; j <= 1; ++j ) { set( x,y,i,j, get(x,y,i,j) + omega * (equilibrium(d,v,i,j) - get(x,y,i,j)) ); } } }

As usual the full project summarized in this article is available on cgit. Lattice Boltzmann Methods are a very interesting approach to modelling fluids on a computer and I hope that the initial theory-heavy section did not completely hide how compact the actual implementation is compared to the generated results. Especially if one doesn’t care for accuracy compared to reality it is very easy to write basic LBM codes and play around in the supremely entertaining field of computational fluid dynamics. Should you be looking for a more serious framework that is actually usable for productive simulations do not hesitate to check out OpenLB, Palabos or waLBerla.

i.e. I’ve now been a student employee of the Lattice Boltzmann Research Group for two years where I contribute to the open source LBM framework OpenLB. Back in 2017 I was granted the opportunity to attend the LBM Spring School in Tunisia. In addition to that I am currently writing my bachelor’s thesis on grid refinement in LBM using OpenLB.↩︎

e.g. boltzbub, compustream, this article.↩︎

Of course the Lattice Boltzmann Method works equally well in three dimensions.↩︎

Dimensionless ratio of inertial compared to viscous forces. The Reynolds number is essential for linking the lattice-based simulation to physical models. LBM simulations tend to be harder to control the higher the Reynolds number - i.e. the more

*liquid*and thus turbulent the fluid becomes. For further details see e.g. Chapter 7*Non-dimensionalisation and Choice of Simulation Parameters*of the book linked right below.↩︎If you want to know more about all the gritty details I can recommend The Lattice Boltzmann Method: Principles and Practice by Krüger et al.↩︎

e.g. laying out the memory to suit the GPU’s cache structure, optimizing instruction sequence and so on↩︎

i.e. the code runs without causing any mentionable GPU load as reported by the handy nvtop performance monitor↩︎

So I recently acquired a reasonably priced second-hand CAD workstation computer featuring a Xeon CPU, plenty of RAM as well as a nice Nvidia K2200 GPU with 4 GiB of memory and 640 cores as the heart of the matter. The plan was that this would enable me to realize my long hedged plans of diving into GPU programming - specifically using compute shaders to implement mathematical simulation type stuff. True to my previously described inclination to procrastinate interesting projects by delving into other interesting topics my first step to realizing this plan was of course acquainting myself with a new Linux distribution: NixOS.

After weeks of configuring I am now in the position of working inside a fully reproducible environment declaratively described by a set of version controlled textfiles^{1}. The main benefit of this is that my project-specific development environments are now easily portable and consistent between all my machines: Spending the morning working on something using the workstation and continuing said work on the laptop between lectures in the afternoon is as easy as syncing the Nix environments. This is in turn easily achieved by including the corresponding `shell.nix`

files in the project’s repository.

Consider for example the environment I use to generate this very website, declaratively described in the Nix language:

with import <nixpkgs> {}; stdenv.mkDerivation rec { name = "blog-env"; env = buildEnv { name = name; paths = buildInputs; }; buildInputs = let generate = pkgs.callPackage ./pkgs/generate.nix {}; preview = pkgs.callPackage ./pkgs/preview.nix {}; katex = pkgs.callPackage ./pkgs/KaTeX.nix {}; in [ generate preview pandoc highlight katex ]; }

Using this `shell.nix`

file the blog can be generated using my mostly custom XSLT-based setup^{2} by issuing a simple `nix-shell --command "generate"`

in the repository root. All dependencies - be it pandoc for markup transformation, a custom KaTeX wrapper for server-side math expression typesetting or my very own InputXSLT - will be fetched and compiled as necessary by Nix.

{ stdenv, fetchFromGitHub, cmake, boost, xalanc, xercesc, discount }: stdenv.mkDerivation rec { name = "InputXSLT"; src = fetchFromGitHub { owner = "KnairdA"; repo = "InputXSLT"; rev = "master"; sha256 = "1j9fld3sh1jyscnsx6ab9jn5x6q67rjh9p3bgsh5na1qrs40dql0"; }; buildInputs = [ cmake boost xalanc xercesc discount ]; meta = with stdenv.lib; { description = "InputXSLT"; homepage = https://github.com/KnairdA/InputXSLT/; license = stdenv.lib.licenses.asl20; }; }

This will work on any system where the Nix package manager is installed without any further manual intervention by the user. So where in the past I had to manually guarantee that all dependencies are available which included compiling and installing my custom site generator stack I can now simply clone the repository and generate the website in a single command^{3}.

It can not be overstated how powerful the system management paradigm implemented by Nix and NixOS is. On NixOS I am finally free to try out anything I desire without fear of polluting my system and creating an unmaintainable mess as everything can be isolated and garbage collected when I don’t need it anymore. Sure it is some additional effort to maintain Nix environments and write a custom derivation here and there for software that is not yet available^{4} in nixpkgs but when your program works or your project compiles you can be sure that it does so because the system is configured correctly and all dependencies are accounted for - nothing works by accident^{5}.

Note that the `nix-shell`

based example presented above is only a small subset of what NixOS offers. Besides shell environments the whole system configuration consisting of systemd services, the networking setup, my user GUI environment and so on is also configured in the Nix language. i.e. the whole system from top to bottom is declaratively described in a consistent fashion.

NixOS is the first destribution I am truly excited for since my initial stint of distro-hopping when I first got into Linux a decade ago. Its declarative package manager and configuration model is true innovation and one of those rare things where you already know that you will never go back to the old way of doing things after barely catching a climpse of it. Sure, other distros can be nice and I greatly enjoyed my nights of compiling Gentoo as well as years spent tinkering with my ArchLinux systems but NixOS offers something truly distinct and incredibly useful. At first I thought about using the Nix and Scheme based GuixSD distribution instead but I got used to the Nix language quickly and do not think that the switch to Guile Scheme as the configuration language adds enough to offset having to deal with GNU’s free software fundamentalism^{6}.

Of course I was not satisfied merely porting my workflows onto a new superior distribution but also had to switch from i3 to XMonad in the same breath. By streamlining my tiling window setup on top of this Haskell-based window manager my setup has reached a new level of minimalism. Layouts are now restricted to either fullscreen, tabbed or simple side by side tiling and everything is controlled using Rofi instances and keybindings. My constant need of checking battery level, fan speed and system performance was fixed by removing all bars and showing only minimally styled windows. And due to the reproducibility^{7} of NixOS the interested reader can check out the full system herself if he so desires! :-) See the home-manager based user environment or specifically the XMonad config for further details.

After getting settled in this new working environment I finally was out of distractions and moved on to my original wish of familiarizing myself with delegating non-graphical work to the GPU. The first presentable result of this undertaking is my minimalistic fieldplay clone computicle.

What computicle does is simulate many particles moving according to a vector field described by a function $f : \mathbb{R}^2 \to \mathbb{R}^2$ that is interpreted as a ordinary differential equation to be solved using classical Runge-Kutta methods. As this problem translates into many similiar calculations performed per particle without any communication to other particles it is an ideal candidate for massive parallelization using GLSL compute shaders on the GPU.

#version 430 layout (local_size_x = 1) in; layout (std430, binding=1) buffer bufferA{ float data[]; }; vec2 f(vec2 v) { return vec2( cos(v.x*sin(v.y)), sin(v.x-v.y) ); } vec2 classicalRungeKutta(float h, vec2 v) { const vec2 k1 = f(v); const vec2 k2 = f(v + h/2. * k1); const vec2 k3 = f(v + h/2. * k2); const vec2 k4 = f(v + h * k3); return v + h * (1./6.*k1 + 1./3.*k2 + 1./3.*k3 + 1./6.*k4); } [...] void main() { const uint i = 3*gl_GlobalInvocationID.x; const vec2 v = vec2(data[i+0], data[i+1]); const vec2 w = classicalRungeKutta(0.01, v); data[i+0] = w.x; // particle x position data[i+1] = w.y; // particle y position data[i+2] += 0.01; // particle age }

As illustrated by this simplified extract of computicle’s compute shader, writing code for the GPU can look and feel quite similar to to targeting the CPU in the C language. It fits that my main gripes during development were not with the GPU code itself but rather with the surrounding C++ code required to pass the data back an forth and talk to the OpenGL state machine in a sensible manner.

The first issue was how to include GLSL shader source into my C++ application. While the way OpenGL accepts shaders as raw strings and compiles them for the GPU on the fly is not without benefits (e.g. switching shaders generated at runtime is trivial) it can quickly turn ugly and doesn’t feel well integrated into the overall language. Reading shader source from text files at runtime was not the way I wanted to go as this would feel even more clunky and unstable. What I settled on until the committee comes through with something like `std::embed`

is to include the shader source as multi-line string literals stored in static constants placed in separate headers. This *works* for now and offers at least syntax highlighting in terms of editor support.

What would be really nice is if the shaders could be generated from a domain-specific language and statically verified at compile time. Such a solution could also offer unified tools for handling uniform variables and data buffer bindings. While something like that doesn’t seem to be available for C++^{8} I stumbled upon the very interesting LambdaCube 3D and varjo projects. The former promises to become a Haskell-like purely functional language for GPU programming and the latter is an interesting GLSL generating framework for LISP.

I also could not find any nice and reasonably lightweight library for interfacing with the OpenGL API in a modern fashion. I ended up creating my own scope-guard type wrappers around the OpenGL functionality required by computicle but while what I ended up with looks nice it is probably of limited portability to other applications.

// simplified extract of computicle's update loop window.render([&]() { [...] if ( timer::millisecondsSince(last_frame) >= 1000/max_ups ) { auto guard = compute_shader->use(); compute_shader->setUniform("world", world_width, world_height); compute_shader->dispatch(particle_count); last_frame = timer::now(); } [...] { auto texGuard = texture_framebuffers[0]->use(); auto sdrGuard = scene_shader->use(); scene_shader->setUniform("MVP", MVP); [...] particle_buffer->draw(); } { auto guard = display_shader->use(); display_shader->setUniform("screen_textures", textures); display_shader->setUniform("screen_textures_size", textures.size()); glClear(GL_COLOR_BUFFER_BIT); display_buffer->draw(textures); } });

One idea that I am currently toying with in respect to my future GPU-based projects is to abandon C++ as the host language and instead use a more flexible^{9} language such as Scheme or Haskell for generating the shader code and communicating with the GPU. This could work out well as the performance of CPU code doesn’t matter as much when the bulk of the work is performed by shaders. At least this is the impression I got from my field visualization experiment - the CPU load was minimal independent of how many kiloparticles were simulated.

See nixos_system and nixos_home↩︎

See the summary node or Expanding XSLT using Xalan and C++↩︎

And this works on all my systems, including my Surface 4 tablet where I installed Nix on top of Debian running in WSL↩︎

Which is not a big problem in practice as the repository already provides a vast set of software and builders for many common build systems and adapters for language specific package managers. For example my Vim configuration including plugin management is also handled by Nix. The clunky custom texlive installation I maintained on my ArchLinux system was replaced by nice, self-contained shell environments that only provide the $\LaTeX$ packages that are actually needed for the document at hand.↩︎

At least if you are careful about what is installed imperatively using

`nix-env`

or if you use the`pure`

flag in`nix-shell`

↩︎Which I admire greatly - but I also want to use the full power of my GPU and run proprietary software when necessary↩︎

And the system really is fully reproducible: I now tested this two times, once when moving the experimental setup onto a new SSD and once when installing the workstation config on my laptop. Each time I was up and running with the full configuration as I left it in under half an hour. Where before NixOS a full system failure would have incurred days of restoring backups, reconstructing my specific configuration and reinstalling software I can now be confident that I can be up and running on a replacement machine simply by cloning a couple of repositories and restoring a home directory backup.↩︎

At least when one wants to work with compute shaders - I am sure there are solutions in this direction for handling graphic shaders for gaming and CAD type stuff.↩︎

Flexible as in better support for domain-specific languages↩︎