However, does this approach work in general or are the compiler overhead and/or the inability to actually use the VLIWs efficiently too costly?
"good VLIW code" is obviously subjective, and I admit I've never looked into rigorous comparisons between code generation for VLIW and non-VLIW architectures. There was quite a bit of research into VLIW compilation, though, even before Itanium. Monica Lam wrote a well-known piece on software pipelining for VLIW back in 1988, for example (it was included in SIGPLAN's Best of PLDI 1979-1999). Subsequent work by e.g. Gao improved Lam's algorithms. A '96 paper showed software pipelining in a state-of-the-art commercial compiler produced near-optimal scheduling, but that was for the R8000, not Itanium.
But in her retrospective for Best of PLDI Lam more or less agrees with your point regarding VLIW techniques and non-numeric code:
The Itanium, however, does not have a dynamic scheduler which is found in all other
modern processor architectures. Software pipelining is applicable only to codes with predictable behavior like numerical applications; as such, it only expands the number of instructions in innermost loops slightly. On the other hand, the behavior of non-numeric applications is much less predictable; without a dynamic scheduler, an aggressive static scheduler needs to generate codes for many alternate paths, which can lead to code bloat.
That was in 2003, and the Itanium architecture has since evolved, of course.
There seems to be much less research being conducted on VLIW in the past decade than the one before it, judging from the ACM Digital Library. And most of the recent stuff seems to be dealing with problems raised by VLIW (e.g. instruction merging when implementing SMT on VLIW cores) rather than on taking advantage of it.
In the early part of the present century, though, there was quite a bit of VLIW compiler research, so current VLIW compilers may be pretty good. I've on occasion looked at the code generated by the HP-UX 11.31i C compiler1, but I wasn't trying to gauge its quality.
1Spent far too long debugging an intermittent issue that turned out to be caused by a trap representation in a register. Turns out Itanium supports a trap representation - a Not-a-Value - in its integer registers. There was a piece of code that was calling a function declared with void return type, but without a declaration in scope, so the caller implicitly treated it as having int return type. That meant the caller loaded the "return value" from a register when the call returned. Sometimes there was a valid value left in that register; once in a while it was Not-a-Value, which caused the kernel to raise SIGILL. Elusive. There was a compiler diagnostic for the lack of a declaration, but it was an old code base full of warnings, and a build system that discarded those warnings if the build succeeded. Sigh.
The Itanium register trap representation is not a bad idea, but SIGILL is a lousy way to report it. It would have helped if, say, the signal(2) man page mentioned this quirk of the CPU.