计算着色器#

对于某些类型的计算,GPU上的计算着色器可能比仅在CPU上快数千倍。

在本教程中,我们将使用“N-Body模拟”来模拟恒星场。每颗恒星都受到其他恒星引力的影响。对于1,000颗恒星,这意味着我们要对每一帧执行1,000 x 1,000=1,000亿次计算。这段视频有6.5万颗恒星,每帧需要计算42亿次引力。在高端硬件上,它仍然可以运行在60fps!

这是怎么回事?该计划包括三个主要部分:

  • 分配缓冲区并将所有内容粘合在一起的Python代码

  • 可视化着色器,让我们可以看到缓冲区中的数据

  • 计算着色器,用于移动所有对象

缓冲区#

我们需要一个地方来存储我们要可视化的数据。为此,我们将创建两个 Shader Storage Buffer Objects (SSBO)来自我们的Python代码中的浮点数。一个将保存前一帧的起始位置,另一个将用于存储计算下一帧的位置。

每个缓冲区必须能够为每个恒星存储以下内容:

  1. 存储的每颗恒星的x、y和半径

  2. 恒星的速度,它将不会被可视化所使用

  3. 星的浮点RGBA颜色

生成对齐的数据#

为了避免GPU内存对齐问题,我们将使用下面的函数来生成准备加载到SSBO中的对齐良好的数据。文档字符串和注释更详细地解释了原因:

生成匹配良好的数据以加载到GPU上#
def gen_initial_data(
        screen_size: Tuple[int, int],
        num_stars: int = NUM_STARS,
        use_color: bool = False
) -> array:
    """
    Generate an :py:class:`~array.array` of randomly positioned star data.

    Some of this data is wasted as padding because:

    1. GPUs expect SSBO data to be aligned to multiples of 4
    2. GLSL's vec3 is actually a vec4 with compiler-side restrictions,
       so we have to use 4-length vectors anyway.

    :param screen_size: A (width, height) of the area to generate stars in
    :param num_stars: How many stars to generate
    :param use_color: Whether to generate white or randomized pastel stars
    :return: an array of star position data
    """
    width, height = screen_size
    color_channel_min = 0.5 if use_color else 1.0

    def _data_generator() -> Generator[float, None, None]:
        """Inner generator function used to illustrate memory layout"""

        for i in range(num_stars):
            # Position/radius
            yield random.randrange(0, width)
            yield random.randrange(0, height)
            yield 0.0  # z (padding, unused by shaders)
            yield 6.0

            # Velocity (unused by visualization shaders)
            yield 0.0
            yield 0.0
            yield 0.0  # vz (padding, unused by shaders)
            yield 0.0  # vw (padding, unused by shaders)

            # Color
            yield random.uniform(color_channel_min, 1.0)  # r
            yield random.uniform(color_channel_min, 1.0)  # g
            yield random.uniform(color_channel_min, 1.0)  # b
            yield 1.0  # a

    # Use the generator function to fill an array in RAM
    return array('f', _data_generator())

分配缓冲区#

分配缓冲区并将数据加载到GPU上#
        # --- Create buffers

        # Create pairs of buffers for the compute & visualization shaders.
        # We will swap which buffer instance is the initial value and
        # which is used as the current value to write to.

        # ssbo = shader storage buffer object
        initial_data = gen_initial_data(self.get_size(), use_color=USE_COLORED_STARS)
        self.ssbo_previous = self.ctx.buffer(data=initial_data)
        self.ssbo_current = self.ctx.buffer(data=initial_data)

        # vao = vertex array object
        # Format string describing how to interpret the SSBO buffer data.
        # 4f = position and size -> x, y, z, radius
        # 4x4 = Four floats used for calculating velocity. Not needed for visualization.
        # 4f = color -> rgba
        buffer_format = "4f 4x4 4f"

        # Attribute variable names for the vertex shader
        attributes = ["in_vertex", "in_color"]

        self.vao_previous = self.ctx.geometry(
            [BufferDescription(self.ssbo_previous, buffer_format, attributes)],
            mode=self.ctx.POINTS,
        )
        self.vao_current = self.ctx.geometry(
            [BufferDescription(self.ssbo_current, buffer_format, attributes)],
            mode=self.ctx.POINTS,
        )

可视化着色器#

现在我们有了数据,我们需要能够将其可视化。我们将通过应用顶点、几何体和碎片着色器将SSBO中的数据转换为像素来实现这一点。对于阵列中的每个恒星的12个浮点,将发生以下数据流:

../../_images/shaders.svg

顶点着色器#

在本教程中,顶点着色器将对每个恒星的12个浮点长的原始填充数据在 self.ssbo_current 。每次执行都会将干净的类型化数据输出到几何体着色器的一个实例。

数据按如下方式读入:

  • 每颗恒星的x、y和半径可通过 in_vertex

  • 星的浮点RGBA颜色,VIA in_color

shaders/vertex_shader.glsl#
 1#version 330
 2
 3in vec4 in_vertex;
 4in vec4 in_color;
 5
 6out vec2 vertex_pos;
 7out float vertex_radius;
 8out vec4 vertex_color;
 9
10void main()
11{
12    vertex_pos = in_vertex.xy;
13    vertex_radius = in_vertex.w;
14    vertex_color = in_color;
15}

然后将以下变量作为输入传递到几何体着色器:

  • vertex_pos

  • vertex_radius

  • vertex_color

几何体着色器#

这个 geometry shader 将单个点转换为四边形,在本例中为正方形,可由GPU渲染。它通过以输入点为中心发射四个点来实现这一点。

shaders/geometry_shader.glsl#
 1#version 330
 2
 3layout (points) in;
 4layout (triangle_strip, max_vertices = 4) out;
 5
 6// Use arcade's global projection UBO
 7uniform Projection {
 8    uniform mat4 matrix;
 9} proj;
10
11
12// The outputs from the vertex shader are used as inputs
13in vec2 vertex_pos[];
14in float vertex_radius[];
15in vec4 vertex_color[];
16
17// These are used with EmitVertex to generate four points of
18// a quad centered around vertex_pos[0].
19out vec2 g_uv;
20out vec3 g_color;
21
22void main() {
23    vec2 center = vertex_pos[0];
24    vec2 hsize = vec2(vertex_radius[0]);
25
26    g_color = vertex_color[0].rgb;
27
28    gl_Position = proj.matrix * vec4(vec2(-hsize.x, hsize.y) + center, 0.0, 1.0);
29    g_uv = vec2(0, 1);
30    EmitVertex();
31
32    gl_Position = proj.matrix * vec4(vec2(-hsize.x, -hsize.y) + center, 0.0, 1.0);
33    g_uv = vec2(0, 0);
34    EmitVertex();
35
36    gl_Position = proj.matrix * vec4(vec2(hsize.x, hsize.y) + center, 0.0, 1.0);
37    g_uv = vec2(1, 1);
38    EmitVertex();
39
40    gl_Position = proj.matrix * vec4(vec2(hsize.x, -hsize.y) + center, 0.0, 1.0);
41    g_uv = vec2(1, 0);
42    EmitVertex();
43
44    // End geometry emmission
45    EndPrimitive();
46}

片段着色器#

A fragment shader 对四边形中的每个像素运行。它将四边形内的UV坐标转化为浮点RGBA值。在本教程中,着色器在每个恒星的四边形的表面上生成柔和的发光圆圈。

shaders/fragment_shader.glsl#
 1#version 330
 2
 3in vec2 g_uv;
 4in vec3 g_color;
 5
 6out vec4 out_color;
 7
 8void main()
 9{
10    float l = length(vec2(0.5, 0.5) - g_uv.xy);
11    if ( l > 0.5)
12    {
13        discard;
14    }
15    float alpha;
16    if (l == 0.0)
17        alpha = 1.0;
18    else
19        alpha = min(1.0, .60-l * 2);
20
21    vec3 c = g_color.rgb;
22    // c.xy += v_uv.xy * 0.05;
23    // c.xy += v_pos.xy * 0.75;
24    out_color = vec4(c, alpha);
25}

计算着色器#

现在我们有了一种显示数据的方法,我们应该更新它。

我们在前面创建了多对缓冲区。我们将使用一个SSBO作为 input buffer 保存前一帧的数据,另一帧作为我们的 output 要将结果写入的缓冲区。

然后,我们在绘制后的每一帧中交换缓冲区,使用输出作为下一帧的输入,并重复该过程,直到程序停止运行。

shaders/compute_shader.glsl#
 1#version 430
 2
 3// Set up our compute groups.
 4// The COMPUTE_SIZE_X and COMPUTE_SIZE_Y values will be replaced
 5// by the Python code with actual values. This does not happen
 6// automatically, and must be called manually.
 7layout(local_size_x=COMPUTE_SIZE_X, local_size_y=COMPUTE_SIZE_Y) in;
 8
 9// Input uniforms would go here if you need them.
10// The examples below match the ones commented out in main.py
11//uniform vec2 screen_size;
12//uniform float frame_time;
13
14// Structure of the star data
15struct Star
16{
17    vec4 pos;
18    vec4 vel;
19    vec4 color;
20};
21
22// Input buffer
23layout(std430, binding=0) buffer stars_in
24{
25    Star stars[];
26} In;
27
28// Output buffer
29layout(std430, binding=1) buffer stars_out
30{
31    Star stars[];
32} Out;
33
34void main()
35{
36    int curStarIndex = int(gl_GlobalInvocationID);
37
38    Star in_star = In.stars[curStarIndex];
39
40    vec4 p = in_star.pos.xyzw;
41    vec4 v = in_star.vel.xyzw;
42
43    // Move the star according to the current force
44    p.xy += v.xy;
45
46    // Calculate the new force based on all the other bodies
47    for (int i=0; i < In.stars.length(); i++) {
48        // If enabled, this will keep the star from calculating gravity on itself
49        // However, it does slow down the calcluations do do this check.
50        //  if (i == x)
51        //      continue;
52
53        // Calculate distance squared
54        float dist = distance(In.stars[i].pos.xyzw.xy, p.xy);
55        float distanceSquared = dist * dist;
56
57        // If distance is too small, extremely high forces can result and
58        // fling the star into escape velocity and forever off the screen.
59        // Using a reasonable minimum distance to prevents this.
60        float minDistance = 0.02;
61        float gravityStrength = 0.3;
62        float simulationSpeed = 0.002;
63        float force = min(minDistance, gravityStrength / distanceSquared) * -simulationSpeed;
64
65        vec2 diff = p.xy - In.stars[i].pos.xyzw.xy;
66        // We should normalize this I think, but it doesn't work.
67        //  diff = normalize(diff);
68        vec2 delta_v = diff * force;
69        v.xy += delta_v;
70    }
71
72
73    Star out_star;
74    out_star.pos.xyzw = p.xyzw;
75    out_star.vel.xyzw = v.xyzw;
76
77    vec4 c = in_star.color.xyzw;
78    out_star.color.xyzw = c.xyzw;
79
80    Out.stars[curStarIndex] = out_star;
81}

完成的Python程序#

该代码包括详细的文档字符串和解释其工作原理的注释。

main.py#
  1"""
  2N-Body Gravity with Compute Shaders & Buffers
  3"""
  4import random
  5from array import array
  6from pathlib import Path
  7from typing import Generator, Tuple
  8
  9import arcade
 10from arcade.gl import BufferDescription
 11
 12# Window dimensions in pixels
 13WINDOW_WIDTH = 800
 14WINDOW_HEIGHT = 600
 15
 16# Size of performance graphs in pixels
 17GRAPH_WIDTH = 200
 18GRAPH_HEIGHT = 120
 19GRAPH_MARGIN = 5
 20
 21NUM_STARS: int = 4000
 22USE_COLORED_STARS: bool = True
 23
 24
 25def gen_initial_data(
 26        screen_size: Tuple[int, int],
 27        num_stars: int = NUM_STARS,
 28        use_color: bool = False
 29) -> array:
 30    """
 31    Generate an :py:class:`~array.array` of randomly positioned star data.
 32
 33    Some of this data is wasted as padding because:
 34
 35    1. GPUs expect SSBO data to be aligned to multiples of 4
 36    2. GLSL's vec3 is actually a vec4 with compiler-side restrictions,
 37       so we have to use 4-length vectors anyway.
 38
 39    :param screen_size: A (width, height) of the area to generate stars in
 40    :param num_stars: How many stars to generate
 41    :param use_color: Whether to generate white or randomized pastel stars
 42    :return: an array of star position data
 43    """
 44    width, height = screen_size
 45    color_channel_min = 0.5 if use_color else 1.0
 46
 47    def _data_generator() -> Generator[float, None, None]:
 48        """Inner generator function used to illustrate memory layout"""
 49
 50        for i in range(num_stars):
 51            # Position/radius
 52            yield random.randrange(0, width)
 53            yield random.randrange(0, height)
 54            yield 0.0  # z (padding, unused by shaders)
 55            yield 6.0
 56
 57            # Velocity (unused by visualization shaders)
 58            yield 0.0
 59            yield 0.0
 60            yield 0.0  # vz (padding, unused by shaders)
 61            yield 0.0  # vw (padding, unused by shaders)
 62
 63            # Color
 64            yield random.uniform(color_channel_min, 1.0)  # r
 65            yield random.uniform(color_channel_min, 1.0)  # g
 66            yield random.uniform(color_channel_min, 1.0)  # b
 67            yield 1.0  # a
 68
 69    # Use the generator function to fill an array in RAM
 70    return array('f', _data_generator())
 71
 72
 73class NBodyGravityWindow(arcade.Window):
 74
 75    def __init__(self):
 76        # Ask for OpenGL context supporting version 4.3 or greater when
 77        # calling the parent initializer to make sure we have compute shader
 78        # support.
 79        super().__init__(
 80            WINDOW_WIDTH, WINDOW_HEIGHT,
 81            "N-Body Gravity with Compute Shaders & Buffers",
 82            gl_version=(4, 3),
 83            resizable=False
 84        )
 85        # Attempt to put the window in the center of the screen.
 86        self.center_window()
 87
 88        # --- Create buffers
 89
 90        # Create pairs of buffers for the compute & visualization shaders.
 91        # We will swap which buffer instance is the initial value and
 92        # which is used as the current value to write to.
 93
 94        # ssbo = shader storage buffer object
 95        initial_data = gen_initial_data(self.get_size(), use_color=USE_COLORED_STARS)
 96        self.ssbo_previous = self.ctx.buffer(data=initial_data)
 97        self.ssbo_current = self.ctx.buffer(data=initial_data)
 98
 99        # vao = vertex array object
100        # Format string describing how to interpret the SSBO buffer data.
101        # 4f = position and size -> x, y, z, radius
102        # 4x4 = Four floats used for calculating velocity. Not needed for visualization.
103        # 4f = color -> rgba
104        buffer_format = "4f 4x4 4f"
105
106        # Attribute variable names for the vertex shader
107        attributes = ["in_vertex", "in_color"]
108
109        self.vao_previous = self.ctx.geometry(
110            [BufferDescription(self.ssbo_previous, buffer_format, attributes)],
111            mode=self.ctx.POINTS,
112        )
113        self.vao_current = self.ctx.geometry(
114            [BufferDescription(self.ssbo_current, buffer_format, attributes)],
115            mode=self.ctx.POINTS,
116        )
117
118        # --- Create the visualization shaders
119
120        vertex_shader_source = Path("shaders/vertex_shader.glsl").read_text()
121        fragment_shader_source = Path("shaders/fragment_shader.glsl").read_text()
122        geometry_shader_source = Path("shaders/geometry_shader.glsl").read_text()
123
124        # Create the complete shader program which will draw the stars
125        self.program = self.ctx.program(
126            vertex_shader=vertex_shader_source,
127            geometry_shader=geometry_shader_source,
128            fragment_shader=fragment_shader_source,
129        )
130
131        # --- Create our compute shader
132
133        # Load in the raw source code safely & auto-close the file
134        compute_shader_source = Path("shaders/compute_shader.glsl").read_text()
135
136        # Compute shaders use groups to parallelize execution.
137        # You don't need to understand how this works yet, but the
138        # values below should serve as reasonable defaults. Later, we'll
139        # preprocess the shader source by replacing the templating token
140        # with its corresponding value.
141        self.group_x = 256
142        self.group_y = 1
143
144        self.compute_shader_defines = {
145            "COMPUTE_SIZE_X": self.group_x,
146            "COMPUTE_SIZE_Y": self.group_y
147        }
148
149        # Preprocess the source by replacing each define with its value as a string
150        for templating_token, value in self.compute_shader_defines.items():
151            compute_shader_source = compute_shader_source.replace(templating_token, str(value))
152
153        self.compute_shader = self.ctx.compute_shader(source=compute_shader_source)
154
155        # --- Create the FPS graph
156
157        # Enable timings for the performance graph
158        arcade.enable_timings()
159
160        # Create a sprite list to put the performance graph into
161        self.perf_graph_list = arcade.SpriteList()
162
163        # Create the FPS performance graph
164        graph = arcade.PerfGraph(GRAPH_WIDTH, GRAPH_HEIGHT, graph_data="FPS")
165        graph.position = GRAPH_WIDTH / 2, self.height - GRAPH_HEIGHT / 2
166        self.perf_graph_list.append(graph)
167
168    def on_draw(self):
169        # Clear the screen
170        self.clear()
171        # Enable blending so our alpha channel works
172        self.ctx.enable(self.ctx.BLEND)
173
174        # Bind buffers
175        self.ssbo_previous.bind_to_storage_buffer(binding=0)
176        self.ssbo_current.bind_to_storage_buffer(binding=1)
177
178        # If you wanted, you could set input variables for compute shader
179        # as in the lines commented out below. You would have to add or
180        # uncomment corresponding lines in compute_shader.glsl
181        # self.compute_shader["screen_size"] = self.get_size()
182        # self.compute_shader["frame_time"] = self.frame_time
183
184        # Run compute shader to calculate new positions for this frame
185        self.compute_shader.run(group_x=self.group_x, group_y=self.group_y)
186
187        # Draw the current star positions
188        self.vao_current.render(self.program)
189
190        # Swap the buffer pairs.
191        # The buffers for the current state become the initial state,
192        # and the data of this frame's initial state will be overwritten.
193        self.ssbo_previous, self.ssbo_current = self.ssbo_current, self.ssbo_previous
194        self.vao_previous, self.vao_current = self.vao_current, self.vao_previous
195
196        # Draw the graphs
197        self.perf_graph_list.draw()
198
199
200
201if __name__ == "__main__":
202    app = NBodyGravityWindow()
203    arcade.run()

支持3D的本教程的扩展版本可在以下位置获得:https://github.com/pvcraven/n-body