Speeding up vertex-array fillrates: Sequential Loop vs. vDSP

Recently I saw the WWDC presentation on the Accelerate-framework Apple provides for the iPhone.

Now I thought to myself: “My, one very big part of my 3D-Engines code is dedicated to putting 3D geometry information into interleaved arrays, maybe I could speed that up?”

I outlined my reasons for building my 3D-Engine the way I do in this post. Summing up, to increase fps I issue only one glDrawElements for one big interleaved array. But I don’t reset the array for each frame, I only change it once the geometry changes. This way I’m able to draw a scene with 32700 vertices at 30 fps with reasonable battery consumption, because all the regular computation is done by the GPU.

But: To load the array or change parts of it one needs to do a lot of calculations on the CPU, namely the rotation and the translation of the basic geometry to the position in the scene and its supposed angle.

So first, this is the way my interleaved array struct looks (basically again, I learned this from Jeff Lamarches Blog):

typedef struct {
   float x;
   float y;
   float z;
} Vertex3;
typedef struct _iVertex3D
   unsigned int color;
   Vertex3 v;
   Vertex3 n;
   float uv[2];
} iVertex3D;
static iVertex3D _interleavedVerts[MAX_VERTS];

Now I’m going to compare two versions of code, that add a complete geometry of a scrub into this array. Overall this geometry has 486 vertices. This is the way it looks like on the gameboard, when drawn in the described way:

Here (xc,yc) is the place I want my geometry to be moved to (since I’m on a game board I don’t have to move in z-direction) and “angle” is the angle the geometry needs to be turned. Written as a Matrix Operation this looks like the following:

Using a for-loop:

unsigned boughcolor = (255 << 24) | (0 << 16) | (41 << 8 ) | (43 << 0);
float co = cosf(angle);
float si = sinf(angle);
for (int k=0; k<scrubVertexCount; k++) {
        vert->iVertex3D *vert = &_interleavedVerts[_vertexCount];
 	vert->v.x = xc+(co*scrubData[k].v.x - si*scrubData[k].v.y);
	vert->v.y = yc+(co*scrubData[k].v.y + si*scrubData[k].v.x);
	vert->v.z = scrubData[k].v.z;
	vert->uv[0] = 0.99;
	vert->uv[1] = 0.99;
	vert->n.x = co*scrubData[k].n.x - si*scrubData[k].n.y;
	vert->n.y = co*scrubData[k].n.y + si*scrubData[k].n.x;
	vert->n.z = scrubData[k].n.z;
	vert->color = boughcolor;


Using a memcpy and vDSP:

void* dest = _interleavedVerts+_vertexCount;
memcpy(dest, basicScrub, scrubVertexCount*sizeof(iVertex3D));

float co = cosf(angle);
float si = sinf(angle);
float msi = -si;

int stride = 9;

//tempVecX = co* x + xc;
vDSP_vsmsa(scrubX, 1, &co, &xc, tempVecX, 1, scrubVertexCount);
//tempVecY = si* x + yc;
vDSP_vsmsa(scrubX, 1, &si, &yc, tempVecY, 1, scrubVertexCount);
// x = -si*y + tempVecX
vDSP_vsma(scrubY, 1, &msi, tempVecX, 1, 
   &_interleavedVerts[_vertexCount].v.x, stride, scrubVertexCount);
// y = co*y  + tempVecY
vDSP_vsma(scrubY, 1, &co, tempVecY, 1, 
   &_interleavedVerts[_vertexCount].v.y, stride, scrubVertexCount);

//draw the normals:
//tempVecX = co * n.x
vDSP_vsmul(scrubNX, 1, &co, tempVecX, 1, scrubVertexCount);
//tempVecY = si * n.x
vDSP_vsmul(scrubNX, 1, &si, tempVecY, 1, scrubVertexCount);
// n.x = -si*n.y + tempVecX
vDSP_vsma(scrubNY, 1, &msi, tempVecX, 1, 
   &_interleavedVerts[_vertexCount].n.x, stride, scrubVertexCount);
// n.y = co*n.y  + tempVecY
vDSP_vsma(scrubNY, 1, &co, tempVecY, 1, 
   &_interleavedVerts[_vertexCount].n.y, stride, scrubVertexCount);

_vertexCount += scrubVertexCount;

Now what am I doing here?

The first one sequentially adds vertices to the interleaved array using the geonetry data stored in “basicScrub”.

But the second one copys the whole memory area of “basicScrub” with “memcpy” and then uses Apple’s vDSP-framework to calculate the rotation/translation of the x and y coordinates. The point here is, that vDSP uses vector-functions of the processing unit and is therefore able to compute one operation on vector-data in the same time it would on scalar data. (Small Note: “scrubX” and “scrubY” are float[] – arrays that are just copied from “scrubData[k].v.x” and “…y” to make a stride of 1 possible for some vDSP-command, which improves performance a lot)

See the wikipedia-entry for SIMD or the following picture taken from that entry:

How the performance was measured:

To measure the performance of either methods, I’m just putting

NSDate* dateBefore = [NSDate date];
NSTimeInterval sec = [[NSDate date] timeIntervalSinceDate:dateBefore];

around the code to be measured. I tested on a second generation iPod with the wireless turned off. (MBX GPU and 620 Mhz CPU underclocked to 533Mhz)


So it appears the vectorized way is more then 3 times faster!
(On a side note, I clearly had some fun using Blender to draw a bar chart here 😀 )

2 Responses to “Speeding up vertex-array fillrates: Sequential Loop vs. vDSP”

  1. […] Here I’m gonna start to compare the interleaved-array fillrate using sequential loops vs. the performance of vectorized operations with vDSP. I did this for my iPod 2G in another post. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: