Your discussion of BLEU scores is incorrect. You wrote:
> But all BLEU really measures is word-by-word similarity: are the
> same words present in both documents, somewhere?
The documents being compared (the machine translation, with one or more reference translations produced by humans) are first "segmented". Generally, this means broken up into units smaller than an entire document, typically sentences, and those units aligned (not necessarily one-to-one, since one translation may use a single sentence where another translation breaks the thought into two or more sentences; and of course a bad translation may omit a sentence entirely).
> "Wander" doesn't get partial credit for "stroll," nor "sofa" for "couch."
One way to get around this problem is to use multiple reference translations, on the assumption that different human translators may choose different synonyms.
> The complementary problem is that BLEU can give a high
> similarity score to nonsensical language which contains
> the right phrases in the wrong order...
> Now here is a possible garbled output which would get the
> very same score:
> "was being led to the calm as he was would take carry
> him seemed quite when taken"
Your statement is correct, your example is not, if by "phrases" you mean sequences of > 1 word. The reason is that your example shares (as far as I can see) no sequence of two or more words with the better sentence (which I omitted here). That is, the BLEU score compares not only 1-grams (that is, word correspondences), but also N-grams where N > 1. So your example would be penalized for not having any N-grams in common with the better translation where N > 1.
There is a good discussion of BLEU scores on Wikipedia.