In the linked article, a diagram shows two copies of the packet being made, one entirely in kernel space and so presumably avoidable by kernel-space changes anyway. I can't see a couple of memcpy() calls being responsible for 90% of the CPU time (the article claims a 10-fold performance improvement) and so I conclude that the user-space implementation is actually missing out something pretty damn enormous.

I wonder if it is something I'd miss, like security, or routing, or...? Does anyone know?

