计算着色器#

对于某些类型的计算，GPU上的计算着色器可能比仅在CPU上快数千倍。

在本教程中，我们将使用“N-Body模拟”来模拟恒星场。每颗恒星都受到其他恒星引力的影响。对于1,000颗恒星，这意味着我们要对每一帧执行1,000 x 1,000=1,000亿次计算。这段视频有6.5万颗恒星，每帧需要计算42亿次引力。在高端硬件上，它仍然可以运行在60fps！

这是怎么回事？该计划包括三个主要部分：

分配缓冲区并将所有内容粘合在一起的Python代码
可视化着色器，让我们可以看到缓冲区中的数据
计算着色器，用于移动所有对象

缓冲区#

我们需要一个地方来存储我们要可视化的数据。为此，我们将创建两个 Shader Storage Buffer Objects (SSBO)来自我们的Python代码中的浮点数。一个将保存前一帧的起始位置，另一个将用于存储计算下一帧的位置。

每个缓冲区必须能够为每个恒星存储以下内容：

存储的每颗恒星的x、y和半径
恒星的速度，它将不会被可视化所使用
星的浮点RGBA颜色

生成对齐的数据#

为了避免GPU内存对齐问题，我们将使用下面的函数来生成准备加载到SSBO中的对齐良好的数据。文档字符串和注释更详细地解释了原因：

生成匹配良好的数据以加载到GPU上#

def gen_initial_data(
        screen_size: Tuple[int, int],
        num_stars: int = NUM_STARS,
        use_color: bool = False
) -> array:
    """
    Generate an :py:class:`~array.array` of randomly positioned star data.

    Some of this data is wasted as padding because:

    1. GPUs expect SSBO data to be aligned to multiples of 4
    2. GLSL's vec3 is actually a vec4 with compiler-side restrictions,
       so we have to use 4-length vectors anyway.

    :param screen_size: A (width, height) of the area to generate stars in
    :param num_stars: How many stars to generate
    :param use_color: Whether to generate white or randomized pastel stars
    :return: an array of star position data
    """
    width, height = screen_size
    color_channel_min = 0.5 if use_color else 1.0

    def _data_generator() -> Generator[float, None, None]:
        """Inner generator function used to illustrate memory layout"""

        for i in range(num_stars):
            # Position/radius
            yield random.randrange(0, width)
            yield random.randrange(0, height)
            yield 0.0  # z (padding, unused by shaders)
            yield 6.0

            # Velocity (unused by visualization shaders)
            yield 0.0
            yield 0.0
            yield 0.0  # vz (padding, unused by shaders)
            yield 0.0  # vw (padding, unused by shaders)

            # Color
            yield random.uniform(color_channel_min, 1.0)  # r
            yield random.uniform(color_channel_min, 1.0)  # g
            yield random.uniform(color_channel_min, 1.0)  # b
            yield 1.0  # a

    # Use the generator function to fill an array in RAM
    return array('f', _data_generator())

分配缓冲区#

分配缓冲区并将数据加载到GPU上#

        # --- Create buffers

        # Create pairs of buffers for the compute & visualization shaders.
        # We will swap which buffer instance is the initial value and
        # which is used as the current value to write to.

        # ssbo = shader storage buffer object
        initial_data = gen_initial_data(self.get_size(), use_color=USE_COLORED_STARS)
        self.ssbo_previous = self.ctx.buffer(data=initial_data)
        self.ssbo_current = self.ctx.buffer(data=initial_data)

        # vao = vertex array object
        # Format string describing how to interpret the SSBO buffer data.
        # 4f = position and size -> x, y, z, radius
        # 4x4 = Four floats used for calculating velocity. Not needed for visualization.
        # 4f = color -> rgba
        buffer_format = "4f 4x4 4f"

        # Attribute variable names for the vertex shader
        attributes = ["in_vertex", "in_color"]

        self.vao_previous = self.ctx.geometry(
            [BufferDescription(self.ssbo_previous, buffer_format, attributes)],
            mode=self.ctx.POINTS,
        )
        self.vao_current = self.ctx.geometry(
            [BufferDescription(self.ssbo_current, buffer_format, attributes)],
            mode=self.ctx.POINTS,
        )

可视化着色器#

现在我们有了数据，我们需要能够将其可视化。我们将通过应用顶点、几何体和碎片着色器将SSBO中的数据转换为像素来实现这一点。对于阵列中的每个恒星的12个浮点，将发生以下数据流：

顶点着色器#

在本教程中，顶点着色器将对每个恒星的12个浮点长的原始填充数据在 self.ssbo_current 。每次执行都会将干净的类型化数据输出到几何体着色器的一个实例。

数据按如下方式读入：

每颗恒星的x、y和半径可通过 in_vertex
星的浮点RGBA颜色，VIA in_color

shaders/vertex_shader.glsl#

#version 330

in vec4 in_vertex;
in vec4 in_color;

out vec2 vertex_pos;
out float vertex_radius;
out vec4 vertex_color;

void main()
{
    vertex_pos = in_vertex.xy;
    vertex_radius = in_vertex.w;
    vertex_color = in_color;
}

然后将以下变量作为输入传递到几何体着色器：

vertex_pos
vertex_radius
vertex_color

几何体着色器#

这个 geometry shader 将单个点转换为四边形，在本例中为正方形，可由GPU渲染。它通过以输入点为中心发射四个点来实现这一点。

shaders/geometry_shader.glsl#

#version 330

layout (points) in;
layout (triangle_strip, max_vertices = 4) out;

// Use arcade's global projection UBO
uniform Projection {
    uniform mat4 matrix;
} proj;


// The outputs from the vertex shader are used as inputs
in vec2 vertex_pos[];
in float vertex_radius[];
in vec4 vertex_color[];

// These are used with EmitVertex to generate four points of
// a quad centered around vertex_pos[0].
out vec2 g_uv;
out vec3 g_color;

void main() {
    vec2 center = vertex_pos[0];
    vec2 hsize = vec2(vertex_radius[0]);

    g_color = vertex_color[0].rgb;

    gl_Position = proj.matrix * vec4(vec2(-hsize.x, hsize.y) + center, 0.0, 1.0);
    g_uv = vec2(0, 1);
    EmitVertex();

    gl_Position = proj.matrix * vec4(vec2(-hsize.x, -hsize.y) + center, 0.0, 1.0);
    g_uv = vec2(0, 0);
    EmitVertex();

    gl_Position = proj.matrix * vec4(vec2(hsize.x, hsize.y) + center, 0.0, 1.0);
    g_uv = vec2(1, 1);
    EmitVertex();

    gl_Position = proj.matrix * vec4(vec2(hsize.x, -hsize.y) + center, 0.0, 1.0);
    g_uv = vec2(1, 0);
    EmitVertex();

    // End geometry emmission
    EndPrimitive();
}

片段着色器#

A fragment shader 对四边形中的每个像素运行。它将四边形内的UV坐标转化为浮点RGBA值。在本教程中，着色器在每个恒星的四边形的表面上生成柔和的发光圆圈。

shaders/fragment_shader.glsl#

#version 330

in vec2 g_uv;
in vec3 g_color;

out vec4 out_color;

void main()
{
    float l = length(vec2(0.5, 0.5) - g_uv.xy);
    if ( l > 0.5)
    {
        discard;
    }
    float alpha;
    if (l == 0.0)
        alpha = 1.0;
    else
        alpha = min(1.0, .60-l * 2);

    vec3 c = g_color.rgb;
    // c.xy += v_uv.xy * 0.05;
    // c.xy += v_pos.xy * 0.75;
    out_color = vec4(c, alpha);
}

计算着色器#

现在我们有了一种显示数据的方法，我们应该更新它。

我们在前面创建了多对缓冲区。我们将使用一个SSBO作为 input buffer 保存前一帧的数据，另一帧作为我们的 output 要将结果写入的缓冲区。

然后，我们在绘制后的每一帧中交换缓冲区，使用输出作为下一帧的输入，并重复该过程，直到程序停止运行。

shaders/compute_shader.glsl#

#version 430

// Set up our compute groups.
// The COMPUTE_SIZE_X and COMPUTE_SIZE_Y values will be replaced
// by the Python code with actual values. This does not happen
// automatically, and must be called manually.
layout(local_size_x=COMPUTE_SIZE_X, local_size_y=COMPUTE_SIZE_Y) in;

// Input uniforms would go here if you need them.
// The examples below match the ones commented out in main.py
//uniform vec2 screen_size;
//uniform float frame_time;

// Structure of the star data
struct Star
{
    vec4 pos;
    vec4 vel;
    vec4 color;
};

// Input buffer
layout(std430, binding=0) buffer stars_in
{
    Star stars[];
} In;

// Output buffer
layout(std430, binding=1) buffer stars_out
{
    Star stars[];
} Out;

void main()
{
    int curStarIndex = int(gl_GlobalInvocationID);

    Star in_star = In.stars[curStarIndex];

    vec4 p = in_star.pos.xyzw;
    vec4 v = in_star.vel.xyzw;

    // Move the star according to the current force
    p.xy += v.xy;

    // Calculate the new force based on all the other bodies
    for (int i=0; i < In.stars.length(); i++) {
        // If enabled, this will keep the star from calculating gravity on itself
        // However, it does slow down the calcluations do do this check.
        //  if (i == x)
        //      continue;

        // Calculate distance squared
        float dist = distance(In.stars[i].pos.xyzw.xy, p.xy);
        float distanceSquared = dist * dist;

        // If distance is too small, extremely high forces can result and
        // fling the star into escape velocity and forever off the screen.
        // Using a reasonable minimum distance to prevents this.
        float minDistance = 0.02;
        float gravityStrength = 0.3;
        float simulationSpeed = 0.002;
        float force = min(minDistance, gravityStrength / distanceSquared) * -simulationSpeed;

        vec2 diff = p.xy - In.stars[i].pos.xyzw.xy;
        // We should normalize this I think, but it doesn't work.
        //  diff = normalize(diff);
        vec2 delta_v = diff * force;
        v.xy += delta_v;
    }


    Star out_star;
    out_star.pos.xyzw = p.xyzw;
    out_star.vel.xyzw = v.xyzw;

    vec4 c = in_star.color.xyzw;
    out_star.color.xyzw = c.xyzw;

    Out.stars[curStarIndex] = out_star;
}

完成的Python程序#

该代码包括详细的文档字符串和解释其工作原理的注释。

main.py#

"""
N-Body Gravity with Compute Shaders & Buffers
"""
import random
from array import array
from pathlib import Path
from typing import Generator, Tuple

import arcade
from arcade.gl import BufferDescription

# Window dimensions in pixels
WINDOW_WIDTH = 800
WINDOW_HEIGHT = 600

# Size of performance graphs in pixels
GRAPH_WIDTH = 200
GRAPH_HEIGHT = 120
GRAPH_MARGIN = 5

NUM_STARS: int = 4000
USE_COLORED_STARS: bool = True


def gen_initial_data(
        screen_size: Tuple[int, int],
        num_stars: int = NUM_STARS,
        use_color: bool = False
) -> array:
    """
    Generate an :py:class:`~array.array` of randomly positioned star data.

    Some of this data is wasted as padding because:

    1. GPUs expect SSBO data to be aligned to multiples of 4
    2. GLSL's vec3 is actually a vec4 with compiler-side restrictions,
       so we have to use 4-length vectors anyway.

    :param screen_size: A (width, height) of the area to generate stars in
    :param num_stars: How many stars to generate
    :param use_color: Whether to generate white or randomized pastel stars
    :return: an array of star position data
    """
    width, height = screen_size
    color_channel_min = 0.5 if use_color else 1.0

    def _data_generator() -> Generator[float, None, None]:
        """Inner generator function used to illustrate memory layout"""

        for i in range(num_stars):
            # Position/radius
            yield random.randrange(0, width)
            yield random.randrange(0, height)
            yield 0.0  # z (padding, unused by shaders)
            yield 6.0

            # Velocity (unused by visualization shaders)
            yield 0.0
            yield 0.0
            yield 0.0  # vz (padding, unused by shaders)
            yield 0.0  # vw (padding, unused by shaders)

            # Color
            yield random.uniform(color_channel_min, 1.0)  # r
            yield random.uniform(color_channel_min, 1.0)  # g
            yield random.uniform(color_channel_min, 1.0)  # b
            yield 1.0  # a

    # Use the generator function to fill an array in RAM
    return array('f', _data_generator())


class NBodyGravityWindow(arcade.Window):

    def __init__(self):
        # Ask for OpenGL context supporting version 4.3 or greater when
        # calling the parent initializer to make sure we have compute shader
        # support.
        super().__init__(
            WINDOW_WIDTH, WINDOW_HEIGHT,
            "N-Body Gravity with Compute Shaders & Buffers",
            gl_version=(4, 3),
            resizable=False
        )
        # Attempt to put the window in the center of the screen.
        self.center_window()

        # --- Create buffers

        # Create pairs of buffers for the compute & visualization shaders.
        # We will swap which buffer instance is the initial value and
        # which is used as the current value to write to.

        # ssbo = shader storage buffer object
        initial_data = gen_initial_data(self.get_size(), use_color=USE_COLORED_STARS)
        self.ssbo_previous = self.ctx.buffer(data=initial_data)
        self.ssbo_current = self.ctx.buffer(data=initial_data)

        # vao = vertex array object
        # Format string describing how to interpret the SSBO buffer data.
        # 4f = position and size -> x, y, z, radius
        # 4x4 = Four floats used for calculating velocity. Not needed for visualization.
        # 4f = color -> rgba
        buffer_format = "4f 4x4 4f"

        # Attribute variable names for the vertex shader
        attributes = ["in_vertex", "in_color"]

        self.vao_previous = self.ctx.geometry(
            [BufferDescription(self.ssbo_previous, buffer_format, attributes)],
            mode=self.ctx.POINTS,
        )
        self.vao_current = self.ctx.geometry(
            [BufferDescription(self.ssbo_current, buffer_format, attributes)],
            mode=self.ctx.POINTS,
        )

        # --- Create the visualization shaders

        vertex_shader_source = Path("shaders/vertex_shader.glsl").read_text()
        fragment_shader_source = Path("shaders/fragment_shader.glsl").read_text()
        geometry_shader_source = Path("shaders/geometry_shader.glsl").read_text()

        # Create the complete shader program which will draw the stars
        self.program = self.ctx.program(
            vertex_shader=vertex_shader_source,
            geometry_shader=geometry_shader_source,
            fragment_shader=fragment_shader_source,
        )

        # --- Create our compute shader

        # Load in the raw source code safely & auto-close the file
        compute_shader_source = Path("shaders/compute_shader.glsl").read_text()

        # Compute shaders use groups to parallelize execution.
        # You don't need to understand how this works yet, but the
        # values below should serve as reasonable defaults. Later, we'll
        # preprocess the shader source by replacing the templating token
        # with its corresponding value.
        self.group_x = 256
        self.group_y = 1

        self.compute_shader_defines = {
            "COMPUTE_SIZE_X": self.group_x,
            "COMPUTE_SIZE_Y": self.group_y
        }

        # Preprocess the source by replacing each define with its value as a string
        for templating_token, value in self.compute_shader_defines.items():
            compute_shader_source = compute_shader_source.replace(templating_token, str(value))

        self.compute_shader = self.ctx.compute_shader(source=compute_shader_source)

        # --- Create the FPS graph

        # Enable timings for the performance graph
        arcade.enable_timings()

        # Create a sprite list to put the performance graph into
        self.perf_graph_list = arcade.SpriteList()

        # Create the FPS performance graph
        graph = arcade.PerfGraph(GRAPH_WIDTH, GRAPH_HEIGHT, graph_data="FPS")
        graph.position = GRAPH_WIDTH / 2, self.height - GRAPH_HEIGHT / 2
        self.perf_graph_list.append(graph)

    def on_draw(self):
        # Clear the screen
        self.clear()
        # Enable blending so our alpha channel works
        self.ctx.enable(self.ctx.BLEND)

        # Bind buffers
        self.ssbo_previous.bind_to_storage_buffer(binding=0)
        self.ssbo_current.bind_to_storage_buffer(binding=1)

        # If you wanted, you could set input variables for compute shader
        # as in the lines commented out below. You would have to add or
        # uncomment corresponding lines in compute_shader.glsl
        # self.compute_shader["screen_size"] = self.get_size()
        # self.compute_shader["frame_time"] = self.frame_time

        # Run compute shader to calculate new positions for this frame
        self.compute_shader.run(group_x=self.group_x, group_y=self.group_y)

        # Draw the current star positions
        self.vao_current.render(self.program)

        # Swap the buffer pairs.
        # The buffers for the current state become the initial state,
        # and the data of this frame's initial state will be overwritten.
        self.ssbo_previous, self.ssbo_current = self.ssbo_current, self.ssbo_previous
        self.vao_previous, self.vao_current = self.vao_current, self.vao_previous

        # Draw the graphs
        self.perf_graph_list.draw()



if __name__ == "__main__":
    app = NBodyGravityWindow()
    arcade.run()

支持3D的本教程的扩展版本可在以下位置获得：https://github.com/pvcraven/n-body