Have you looked at
vertex arrays?
The idea is that instead of making many glVertex/glColor/glTexCoord calls that send data to the video card, you instead put the data into buffers and send it in batches. If you can get the batches to be of a decent size it'll be significantly faster than immediate mode (which is what you're using now).
You wouldn't see much improvement from batching up the vertices of a single quad, for example, but a bit more than that and it'll make a big difference. The bigger that batch size, the better. From what you're saying, it sounds like you could batch up everything that's using the same texture, and since you're using atlases, that should be a lot.
And yes, you'll be sending extra color data - for each vertex, instead of setting it once when the color changes - but it's worth it. You could also call glColor prior to drawing the vertex array and not enable the GL_COLOR_ARRAY client state. Depends on how often the color changes - if you can get large batches with the same color value, the second approach could be better.