I was always curious as to why a system similar to SwiftKey couldn't be used for voice - ie. looking at words grouped together as a sentence to "guess" what the last one probably was. In many ways, it seems it would be easier to improve accuracy with whole sentences than it would be when giving an AI simple commands like "open", "close", "delete", etc since there's context in a sentence that can be used to help "fill in the gaps" so to speak.
Since there's still a delay of a few seconds in Microsoft's system, having a delay to look at words in a sentence before providing a translation wouldn't be too inconvenient. Also using a library built from previous calls (although it creates a privacy issue) or requiring the user to read out loud a few pages of a book (tedious but helpful) could help.
At the very least it'll make the experience of getting a cab to your hotel in a foreign country after stepping off a 24 hour flight a bit less mental