back to article DOCX disaster recovery: How I rescued my wife from XM-HELL

What do you do when a critical Word document won’t open? Even in today’s world of versioned documents, it is entirely possible for corruption to squeak in and go unnoticed, wrecking your entire version history. But all is not lost. My wife had this happen to her; here’s how we solved it. Real world example In my case, Word …

COMMENTS

This topic is closed for new posts.
  1. WaveSynthBeep

    Proper version control

    It does rather make the point that Proper Version Control (you know, those things with 3-letter names) would have no trouble here, as it's decoupled from the editor in question. I suspect it's probably not as good for diffs in Word etc, but you do at least have the history going back to commit #1.

    1. Buzzword

      Re: Proper version control

      Actually it could support diffs, as long as the tool can handle zipped files. Office 2007+ files are just zipped XML, and you can diff the XML alone quite easily.

    2. Anonymous Coward
      Anonymous Coward

      Re: Proper version control

      I'm not sure a version control system would resolve the issue.

      As I understand it, the problem is the editor is saving garbage to a file but the document looks OK as long as the document is not closed and reopened.

      If your editor is writing garbage to a file or a version management system, you'll still need to recover any work between the last usable document and the current document, and assuming sufficient edits/time you just have a better organised mess.

      If the XML file was parsed to check for problems after a save and a sensible reconcile between the saved and active document occured, this should solve the majority of the mismatched tags issues (assuming there isn't some common fault between the editor and file parser - I'm sure that would never happen.....)

      1. WaveSynthBeep

        Re: Proper version control

        It wouldn't necessarily solve it, but it might help. The main issue is relying on a tool as both editor and version manager. If it screws up, you've lost your version history. That's an eggs-in-one-basket risk avoided by using an external tool.

        If you have an external VCS you've at least got guaranteed access to everything in the history. Some of those might be corrupt, but there will be a known-good version. You can diff the last known-good against the first corrupt to see what changed. It may or may not be straightforward to port that change forward into the latest version.

        Plus you can see what you're changing - if the editor decides to reduce the file size to zero bytes, diff will show you that before you commit.

        The main headache being they don't always play nicely with binary files, but as mentioned there may be a plugin to support zipped files which would help in this case.

        1. Roland6 Silver badge

          Re: Proper version control

          Version control only helps with completed works, it falls down when dealing with errors.

          What is my irritation with the majority of office tools is the lack of proper journalling: Diff will tell you what the byte level differences are, the transaction journal will tell you that the differences are down to the section numbering changing.

    3. asdf
      Trollface

      Re: Proper version control

      >but you do at least have the history going back to commit #1.

      Unless you work where I do where we unnecessarily change version control systems/servers so often that it can be hard to look up history past the last six months. Import/Exporting history doesn't seem to be a priority. Oh well not my company.

      1. Trevor_Pott Gold badge

        Re: Proper version control

        If we had used Dropbox instead of Sync for version control we wouldn't have version history to commit #1. It would have eventually wrapped as it jettisoned the older versions.

        I agree that a proper version control system is a really good idea...but very few people have them.

    4. Anonymous Coward
      Anonymous Coward

      Re: Proper version control

      "The wife was using an old version of LibreOffice Writer"

      That's your problem right there....You get what you pay for.

      1. Trevor_Pott Gold badge

        Re: Proper version control

        Prove it.

    5. Version 1.0 Silver badge

      WTF

      Odd - I've been using WordPerfect for years (like 20 years now I think - I started under DOS and VMS) and I've NEVER had a file become unreadable in all this time.

      Am I just lucky, or is MS Office simply the application from hell?

      1. Diogenes

        Re: WTF

        I've had MS in all incarnations except 2010 (don't have 2013) issue a message totally unrelated to IO or the actual task being done (eg I am trying to redefine a style, and it tells me I can't insert an image or somesuch), die, and the file you are working on disappearing as if it never existed. Happened rarely, no more frequently than once a year So I now close & copy doc every 2 or 3 pages or so

        I remember opening WP 5.0 & 5.1 files in SPF-PC to fix some spectacularly garbled tag fest (usually my fault - badly coded macro)

      2. Anonymous Coward
        Anonymous Coward

        Re: WTF

        According to the article, MS Office didn't corrupt the file, LibreOffice did. MS Office just refused to open a corrupt file.

  2. Anonymous Coward
    Anonymous Coward

    Done something similar

    Back in 2003, I can't remember the full circumstances but I think I managed to hose X, so I was stuck at a command prompt with no GUI, and needing to know where my next lecture would be from an OpenOffice Calc file.

    (Yes, Linux laptop. I was too poor to afford anything more than a Pentium II and no one in their right mind would touch Windows 98.)

    I recall unzipping it and then doing a grep for the approximate text. I was then able to open a text editor on the file, look for the text, and found the information I was after.

  3. NoneSuch Silver badge

    Which is why...

    I save my Word docs as RTF.

    1. Malcolm 1

      Re: Which is why...

      How does that help? Is it somehow immune to errors?

      My experience with editing RTF the hard way is that it is even less amenable to fixing than XML (not least because the tooling is far less developed).

    2. Anonymous Coward
      Anonymous Coward

      Re: Which is why...

      "I save my Word docs as RTF."

      At which point you've so lost much functionality, you might as well use Wordpad.

      1. Anonymous Coward
        Anonymous Coward

        Re: Which is why...

        At which point you've so lost much functionality, you might as well use Wordpad.

        Which for almost all the documents I see a word processor used for would be perfectly fine. RTF is actually a very nice format - easily parsed, human readable, compresses well and quite feature rich.

        1. Kubla Cant

          Re: Which is why...

          RTF is actually a very nice format - easily parsed, human readable

          My recollection of hand-fixing RTF files is that they were only readable in the sense that they consisted of printable ASCII characters. Understanding the RTF was another matter entirely. Maybe I'm subhuman.

          The problem, I suspect, may be that the RTF emitted by Word suffers from the same lack of structure as the HTML emitted by Word. Editing a Word-generated HTML document isn't a pleasant experience.

        2. Michael Wojcik Silver badge

          Re: Which is why...

          If RTF is "very nice", then nearly every other markup language ever invented must be at least "deliriously wonderful".

          I save most of my writing as LaTeX. That also has the advantage of not being editable by Microsoft Word.

          Personally, my feeling is that Microsoft Word is somewhat worse than OpenOffice / LibreOffice for editing other people's Word documents, or creating Word documents when I absolutely must; and unsuitable for any other purpose. (Powerpoint still had the edge over OO/LO Impress the last I checked, and for some kinds of presentations remains more suitable than the various LaTeX and HTML alternatives. Excel, on the other hand, I loathe with every fiber of my being.)

          1. Anonymous Coward
            Anonymous Coward

            Re: Which is why...

            Powerpoint still had the edge over OO/LO Impress the last I checked, and for some kinds of presentations remains more suitable than the various LaTeX and HTML alternatives. Excel, on the other hand, I loathe with every fiber of my being

            To be honest, I originally found Powerpoint more usable than OOs/LOs equivalent until its UI got destroyed or "ribbonised" which *seriously* got in the way of usability, together with the absolutely stupid load of gadgetry that MS tends to use to foul up any usable application (RIP, Visio).

            However, it was at that time we switched to OSX, and Keynote is just *so* much better that we pretty much abandoned Powerpoint completely, not in the least because it somehow promotes austerity in slides, which only benefits the quality. Most of the times we use HaikuDecks or Powtoons instead, but if we have to work on slides internally, Keynote is what we use, and export to PDF later.

            AFAIK there are no real issues with Excel nor LOs equivalent, but we're not a finance house so we don't use all the functions that may make a difference. We do, however, have a hangover from the past which stopped us early from using Excel: we work in multiple languages, and an early Excel spreadsheet in English would not work in, for instance, German because the formula were not tokenised (no, really, I'm serious, for instance "SUM" in English would fail in a German version of Excel which required that to be "SUMME"). If you're shaking your head by now, so did we, I still can't quite believe it. This got us into using OpenOffice pretty early, even before the days where LibreOffice was forked.

            Resuming the original topic, I actually never had a format failure from OpenOffice other than when the first file format change appeared. As someone explained to me, I must have been lucky :)

      2. Anonymous Coward
        Anonymous Coward

        Re: Which is why...

        "At which point you've so lost much functionality, you might as well use Wordpad."

        Or Google Apps / Libre Office. Microsoft Office is still miles ahead of the alternatives.

        1. Trevor_Pott Gold badge

          Re: Which is why...

          "Microsoft Office is still miles ahead of the alternatives."

          List the 'features' in Microsoft Office that I, personally, my family or my clients care about that are available in Microsoft Office that are not available in LibreOffice or Google Apps. Present a solid commercial rationale for why these features are worth the price delta on a per user basis.

          Please include an analysis of the "value" of my data being made available to the NSA/GCHQ/etc on demand so that they can scan it in order to send innocents to jail and/or steal whatever innovations I may have to to give their own companies commercial advantage. Please include an exacting means by which I can ensure that closed source software - let alone American cloud-integrated stuff - is free of such snooping, should I choose not to avail myself of the "feature" of governmental integration.

          If you cannot provide a credible analysis of exactly why and how Microsoft Office provides a better value than the competition, in real dollars and cents for features that I actually care about then I have only two conclusions to draw:

          1) Your absolutist statements are false because they do not apply to everyone.

          2) You are completely and utterly full of shit.

          Please not that both conclusions are not mutually exclusive.

          1. Anonymous Coward
            Anonymous Coward

            Re: Which is why...

            "Microsoft Office is still miles ahead of the alternatives."

            List the 'features' in Microsoft Office that I, personally, my family or my clients care about that are available in Microsoft Office that are not available in LibreOffice or Google Apps. Present a solid commercial rationale for why these features are worth the price delta on a per user basis.

            IMHO none whatsoever, but Word does have ONE (1, uno) feature that I would dearly love to see in LibreOffice: shift-F5 cursor position replay. It means you can zip to another part of a document (using, for instance, Document Map, which LibreOffice's Navigator knocks into a cocked hat with consummate ease), do some editing and then use Shift-F5 until you're back where you came from. For work on larger docs it's very helpful.

            However, as that is about the only feature I can recall that LibreOffice lacked vs Word after a good 20+ years of working on docs I reckon I can live with that (and I simply changed approach to compensate - I just open another window on the same document). Personally, I would love to see a Reveal Codes as it existed in Wordperfect.

            I default to LibreOffice with pleasure. It certainly saves us money, but that's less an issue than that it saves me staff retraining (as its UI is stable, and resembles the Office 2003 layout, pre-rubbish, sorry, ribbon), works identical across platforms and doesn't expose us to license compliance risks. God knows how much time we must have saved not having to worry about licensing.

        2. Anonymous Coward
          Anonymous Coward

          Re: Which is why...

          Miles ahead?

          Maybe. But not necessarily in a direction I'd particularly like to go (aka "Yesterday, we were looking into the abyss. Today, we took a great step ahead.")

          The ribbon UI still manages to confuse me after a couple of years of use, and today's wide-and-low 16:9 screens do not take kindly to having a HUGE ribbon bar at the top. Why, o why can't we at least move that to a column on the side, where screen real estate is not nearly as precious? I resort to having the ribbon auto-hide, but that is still not good UI design.

          The rather obnoxious fixation on "persuading" users to store their stuff on Microsoft's servers (Office 2013 got really pushy about that) is also not to my liking. One year after Snowden, there should at least be an easy to find, global switch (controllable by group policy) to once and for all disable the cloud functionality should a user elect to keep a modicum of privacy.

          Unlike freebies, where the user's data is the currency paid for the privilege of using the service, Office is a product the user pays for with their own money, so the business model would not require revenue from data mining.

    3. Anonymous Coward
      Anonymous Coward

      Re: Which is why...

      I save my Word docs as RTF.

      I save mine in ODF format, or (if I absolutely have to use an MS format) .doc. I tend to avoid the X formats wherever I can, and I thus never had the problems as described (or I've been lucky).

      I have one single copy of MS Office in the whole office, and that's only for compatibility reasons - internally we switched to LibreOffice. That's the benefit of being small :).

      1. Ken Hagan Gold badge

        Re: Which is why...

        "I tend to avoid the X formats wherever I can, and I thus never had the problems as described (or I've been lucky)."

        You've been lucky. The ODF formats are also "zipped XML" and so if the code is willing to emit an ill-formed XML document then it seems perfectly plausible that it would be willing to spit out bad ODF as readily as bad DOCX.

        1. Trevor_Pott Gold badge

          Re: Which is why...

          Given the "how" this occured, I'm 100% positive that LibreOffice Writer 4.1 would have caused the same error (with the same fix) for ODF as it would have for OOXML. I'm less sure that many of the oMath or Excel errors in Word would have been the same in ODF as under OOXML, as they are most frequently issues with "order of tags" rather than "when the tags are committed". Thus it is theoretically possible that Word could write a oMath error to DOCX but not to ODF.

          Either way, I'm now glad that both formats exist as they do; in a human-fixable fashion.

  4. Suricou Raven

    Unusual error

    By far the most common corruption I've found is in the form of truncation - people yanking out their USB sticks before the file is fully written. The ZIP container stores the index at the end of the file, so you can't even open it unless it's all there. After finding every so-called document and zip recovery utility quite useless, I just wrote my own one. It'll let you recover at least partial contents of the zip, hopefully including the document.xml from which raw, unformatted text can be easily extracted.

    Then I just pass it back to the user and tell them to fix the layout and unmount properly next time.

    https://birds-are-nice.me/programming/zipfilerecover.shtml - if anyone ever needs it.

    1. Roland6 Silver badge

      Re: Unusual error

      Interesting tool. I suspect that this is the type of tool the author of the article was really in need of: namely one that could handle office document XML with errors in a sensible and user friendly way.

      Suspect your tool will become even more useful as documents move into the cloud: a truncated network connection also plays havoc with the readability of office documents.

      1. Suricou Raven

        Re: Unusual error

        That thing has nothing to do with XML at all. It just extracts truncated ZIP files. When it comes to actually making some sense of the half-a-document you get from it, you're on your own! In many cases, even recovering nothing but the textual content of a document is still valuable.

        1. Roland6 Silver badge

          Re: Unusual error @Suricou Raven

          Sorry wasn't clear.

          The tool you created (and helpfully linked to, thankyou) understands the Zip file format and so whilst it doesn't repair the zip file, it is a big step forward in recovering the contents of a damaged zip file.

          I was suggesting that a tool that understood office XML could 'recover' an office document so that it could be loaded into say Word, leaving the user to make sense of what was recovered.

    2. Ron007

      Re: Unusual error

      Yes, USB "truncation" is probably much more common than XML Tag errors. But even worse than storing the index at the end of the ZIP is that the Word X format files store the body text "document.xml" at the end of the file. So very commonly you lose most if not all of the text content.

  5. Ugotta B. Kiddingme

    With apologies to Ray Parker, Jr.

    "When there's something strange in your closing tag, who you gonna call?"

    "CodeBusters!"

    1. FartingHippo
      Stop

      Re: With apologies to Ray Parker, Jr.

      That's deep in the uncanny valley of humour.

      So bad it's not funny, but not quite bad enough to be funny for the badness itself.

  6. jzlondon

    What if the corruption had been more than just a missing tag?

    A sensible policy would be Dropbox and a five-minute auto save. That way, you can go back through the file's history and get the last saved version.

    Also, if your content is really important to you, use actual Microsoft Word. Yes, yes, I know. It doesn't run on Linux, it isn't free, it locks you in, etc. But it's bloody well tested and at the end of the day it's your content on the line.

    1. Anonymous Coward
      Anonymous Coward

      MS Word is not immune to this either

      I have PhD students compiling documents with tables and images from all types of sources and very often Word throws a wobbly and loses something, refuses to open or stops working with some add-on from a third party reference manager.

      MS Word is not perfect.

      1. Anonymous Coward
        Anonymous Coward

        Re: MS Word is not immune to this either

        I have PhD students compiling documents with tables and images from all types of sources and very often Word throws a wobbly and loses something,

        Then insist they use a real document formatting system like LaTeX, rather than a piss poor desktop publishing application.

      2. AceRimmer
        Holmes

        Re: MS Word is not immune to this either

        "or stops working with some add-on from a third party reference manager.

        MS Word is not perfect."

        I think the add-on might be the culprit here

        1. Marshalltown

          Culprit

          The culprit is Microsoft and its historical Godzilla-eats-world approach to software. If the composition and finalization of documents in Word (or OO for that matter) followed a rational work flow scheme, these errors would be far fewer and fixes would be immensely easier. As is, you get software that assumes that it can do no wrong, and can do anything you need. Commonly the assumptions are wrong on both counts.

          1. Stevie

            Re: Culprit

            How does the approach taken to finalizing a document in MS Word guide the hand of the better, more conscientious people who wrote OfficeLibre Write, which is the actual culprit in *this* scenario? Are you saying that after all the fuss of "don't use crappy Microsoft, use our better alternative", the design thinking is identical?

            1. Trevor_Pott Gold badge

              Re: Culprit

              Wow. Jesus shit-pickling Christ, will you please get your head out of your very biased ass? Where did I - in the article or the comments - says "don't use Microsoft, use LibreOffice?"

              I said "this is a class of error that at a minimum Word and Writer can and do both cause. In my case, the error was caused by Writer, but anything that writes to a DOCX can theoretically cause it, here's how you fix it."

              Saying "Word does this too" is not saying "LibreOffice is better". They are distinct concepts. Saying "Word does this too" is to reinforce that this is an issue that can occur with the document format, and that any application could theoretically cause it. Or a single-byte corruption error could cause it. Or $deity knows what else.

              The point is that "which productivity suite is 'better'" has absolutely no place in the discussion at all. It doesn't matter. Since both Word and Writer have been proven to cause this class of error then knowing about the class of error and how to solve it are what matter.

              Take your religious issues elsewhere.

              1. jzlondon

                Re: Culprit

                Although in your particular case, the bad formatting would appear to be caused by a bug in Libre Office, rather being a stand-alone issue relating to generic file I/O errors.

              2. Stevie

                Re: Culprit

                Take a breath, Trevor. I was responding to the post immediately before mine written by "Marshalltown ", not to you. I replied to his post but the indent seems not to have happened.

                As for my "bias", it was induced by your article and I wasn't the only person to take away the message I did - suggesting to anyone not in complete turtle defense mode that perhaps your own prose is working against you.

                Indeed, I've tried to write my questions to you in a neutral tone throughout and have been clear I'm trying to understand the issue, not pick on you - with perhaps the sole exception of one comment pointing out that not having documents open for weeks on end has been common wisdom trotted out in these pages for years in response to "the SA booted my machine and I lost days of work" posts. All that following good practice would have done was alert you to the issue sooner of course.

                If you wanted to convey the same sort of neutral tone in your article it was, in my opinion, unwise to illustrate a problem you saw with NOT (MS Office) with examples drawn from the library of problems of the Big Bad Bugger on Campus. *That* is the source of this particular misunderstanding and it is entirely down to an editorial choice made by your good self.

                As for my "religious issues", I already said I use OpenOffice myself. In fact I don't possess a copy of MS Office older than '97 for my personal use. I have to use MS Office at work.

                1. Trevor_Pott Gold badge

                  Re: Culprit

                  With this lot, neutrality doesn't matter. If you aren't fellating Microsoft you're absolutely against them. There is a pack of absolutely rabid anti-open-source types that occupy the comments, and I'm sorry if I inaccurately lumped you in with them. I think it's fairly easy to understand why I did.

                  Really, however, it's this comment that does it: But why go to so much trouble trying to pin this (in the reader's mind's eye) on MS - by the headline which strongly suggests the problem will lie with yet another DOCX issue and by "the most famous example" which is still pretty obscure to be honest - when what you are really up against is an OfficeLibre Write bug?

                  10 points for style but minus a couple of hundred for mendacity.

                  You outright accuse of my lying by somehow attempting to "pin this" on Microsoft. What the fuck? The article in no way attempts to "pin this" on Microsoft. There's absolutely nothing in that article at all that says "Microsoft Word is bad" or "LibreOffice is better". I mention - in the article and in the comments - that Word and LibreOffice can both give rise to errors where you might care about this kind of fix...and I go to some length to discuss the different ways it can occur, with examples using each product.

                  It doesn't really get much more neutral than that. Yes, the error I personally experienced was with LibreOffice writer, but that is completely irrelevant; the error class can be caused by multiple products, and thus mentioning that - with examples - is in the public good.

                  Yet you come out and accuse me of lying to people and somehow trying to "blame" Microsoft. So yeah, you know what? You get lumped in with the batshit-crazy Anonymous Coward and LDS as "rabidly and irrationally pro-Microsoft", to the point where I can't - and won't - take anything you have to say seriously. There's no neutrality or objectivity present in what you said there, there's a massive assumption followed by an attack.

                  The majority of people who read this article didn't walk away with a "Trevor was trying to blame this on Microsoft" vibe in any way shape or form. Some folks, however, see monsters where none exist. I've no time, patience, or respect for them.

                  I don't feel the need to "imply" Microsoft - or anyone else - is at fault for things. If I think Microsoft fucked up, I say so openly. If I think LibreOffice is better and you should buy that, then I say so. There's no pussyfooting around.

                  Comments like Are you saying that after all the fuss of "don't use crappy Microsoft, use our better alternative" responding to an article in which I did not in any way shape or form recommend one product over the other would seem to indicate that you fall into the "you didn't fellate Microsoft, you're obviously an evil, open source economy destroying wretch" camp.

                  So no, sir, I don't accept your "neutral tone" argument. You waltzed in here an accused me of lying. When I said "bullshit", you doubled down. Maybe you aren't rabidly pro-Microsoft, but your presentation was in no way a "neutral tone". If you come in guns blazing, don't get all shocked and shaken if'n I fire back.

    2. Trevor_Pott Gold badge

      Dropbox only keeps so many versions in history. What do you do if Word introduces the error on page 2, autosaves every minute, but you don't close word until page 32? Every single save in the dropbox history would have the error.

      Remember: these errors can be introduced, but go unnoticed until you close the application and attempt to re-open the document.

      Besides, if your last good version was "two pages of text" you are going to want that last 30 pages!

      1. Don Jefe

        The easy solution is to allow Windows to automatically install every last update as soon as they roll out. This way your computer is constantly restarting so you'll never get too far into writing your document :)

        1. Fatman
          Coat

          RE: allow Windows to automatically install every last update

          This way your computer is constantly crashing and restarting so you'll never get too far into writing your document :)

          FTFY!!!

      2. jzlondon

        Dropbox keeps an unlimited number of versions for up to a month, and it keeps a completely unlimited number of versions if you pay a bit extra for "pack rat". I agree that an error could be introduced, but it's still a better safeguard than relying on a plain old filing system.

  7. chivo243 Silver badge

    Lot of work?

    Seems like a lot of work.... First, I would have tried to open it on a Mac with Text Edit, and, if successful, converted to plain text, then back to and office document flavor of your choice. I can’t begin to count the times Text Edit worked the charm. If no success, then it would have been the lot of work route.

    1. jzlondon

      Re: Lot of work?

      TextEdit will never open a .docx file in a readable form - it's not like an old .doc file, as it's actually a zip archive containing a bunch of other files. Will always look unreadable if consumed neat.

  8. Vic

    xmllint is pretty good for finding broken xml...

    [vic@perridge ~]$ echo "<root><tag1></broken></root" | xmllint -

    -:1: parser error : Opening and ending tag mismatch: tag1 line 1 and broken

    <root><tag1></broken></root

    ^

    -:2: parser error : expected '>'

    ^

    Vic.

    1. Charlie Clark Silver badge
      Coat

      Re: xmllint is pretty good for finding broken xml...

      Specifically for the MS files the Office OpenXML SDK Productivity Tool is to be recommended. It can open most archives as long as the [Content_Types].xml - it's fussy about the namespace in this file. It includes validation and comparison tools and will automatically reflow the XML.

      Otherwise simply unpack the archive with unzip, run suspect files through tidy and open them in you editor of choice.

      unzip -d xml fucked_file.docx

      tidy -m -xml xml/word/document.xml

      To be fair to the LibreOffice developers, OOXML is a shit format. It's difficult to get right partly because it's so fucking verbose. The specification is thousands of pages long and even then vague. However, the LibreOffice developers also have a history of releasing poor code. I've replaced it with the more conservative OpenOffice because it crashes too much on my Mac.

      Mine's the one with the ECMA 476 specification in the pockets, ta.

    2. Michael Wojcik Silver badge

      Re: xmllint is pretty good for finding broken xml...

      Thanks. I was wondering how far I'd get into the comments before someone pointed out that there are tools specifically for identifying malformed XML. That's part of the point of the language - redundancy to make error detection and recovery easier.

      For that matter, it'd be trivial to take one of the streaming XML parsing APIs (e.g. any SAX implementation) and write your own program that simply checked for closing a tag that's not at the top of the stack, or for a malformed tag sequence. A typical Windows programmer should be able to write one using, say, Xerces or the .NET XML classes in an hour or two. There's something to be said for repurposing existing tools - Trevor's mucking about with browsers made me a bit nostalgic for hacking on 8- and 16-bitters where we often didn't have proper tools - but these days we have an embarrassment of technical resources at our fingertips.

      All's well that ends well, I suppose; but I think the real lesson here is "get a copy of xmllint".

      1. Roland6 Silver badge

        Re: xmllint is pretty good for finding broken xml...

        >but I think the real lesson here is "get a copy of xmllint".

        If Trevor hasn't deleted all the corrupted versions of the document, it would be interesting to see what assistance xmllint provides in resolving the problem.

  9. Robert Harrison

    Which Office product is at fault?

    Trevor, from the 1st para it feels like you're blaming MS Office, where in fact LibreOffice is the culprit behind the original issue:

    "...Word wouldn’t open an important file, dying instead with the error “the name in the end tag of the element must match the element type in the start tag”. Translated from Microsoftese: “The word processor that created this document made an XML boo-boo, and Word is going to refuse to read this document now...”

    "...The wife was using an old version of LibreOffice Writer (v4.1) and had made several changes to hyperlinks in one area of the document. Writer got confused somehow, opened a hyperlink tag, but didn’t actually put in any information as to where it was hyperlinking to, and didn’t close the tag."

    I'm trying not to be biased here, MS Office works-for-me (tm) but I can't argue with LibreOffice's price tag :-)

    1. Anonymous Coward
      Anonymous Coward

      Re: Which Office product is at fault?

      I'm not sure that matters, as I have seen both screw up on complex documents. Sometimes LibreOffice will open one that Word refuses, but when they go wrong you always seem to loose something

      As other commentards have said, best to save and re-open frequently, and with differing file names or with a versioning system in place. Puts a neat limit on where you lost the data (and/or will to live).

    2. Shoot Them Later
      Windows

      Re: Which Office product is at fault?

      LibreOffice is horribly at fault here, and it's inexcusable if the problem is as described. I mean, how does it construct its XML - string concatenation or something?? Any half decent programmer uses a DOM object of some sort to manipulate XML documents - these will not let you add an opening tag and 'forget' to close it; adding an element should be an atomic operation.

      Aside from all that, what's to stop LibreOffice *parsing* the document before saving it, just to be sure. Maybe, I don't know, using some kind of published schema?

      No excuse. Everything has bugs now and then, but committing unparseable XML to disk should not be one of them.

      1. Trevor_Pott Gold badge

        Re: Which Office product is at fault?

        "No excuse. Everything has bugs now and then, but committing unparseable XML to disk should not be one of them."

        Word does it too. A link from the article goes here: http://support.microsoft.com/kb/2528942/en-us which is one of the more famous examples of Word doing this exact same thing in a subtly different way.

        1. Anonymous Coward
          Anonymous Coward

          Re: Which Office product is at fault?

          Which looks to be fixed since Office 2010 SP1....

          1. Trevor_Pott Gold badge

            Re: Which Office product is at fault?

            "Which looks to be fixed since Office 2010 SP1"

            what does that have to do with anything? The fault existed. Similar ones have existed in the past. More will exist in the future. Microsoft doesn't get a free pass because they fixed a fault. They're just as much of a risk for this type of issue as anyone else. Besides, I know of at least three others in Excel that can cause a similar XLSX corruption that, to my knowledge, still affect 2013.

        2. Ron007

          Re: Which Office product is at fault?

          Yes, that is an example of a tag error that has been fixed.

          Even after that fix, there is still at least one remaining tag error. It appears to happen when you create an equation, exit the equation editor then later edit the equation. Rather than editing existing equations, it is safer to simply recreate it from scratch, making the needed correction.

      2. Ken Hagan Gold badge

        Re: Which Office product is at fault?

        "Any half decent programmer uses a DOM object of some sort to manipulate XML documents"

        More generally, almost any conversion from one structured format to another is an example of parsing, whether from file to memory or vice versa or neither) and parsing was analysed to death in the 60s and there are today several excellent parser generators available to anyone who is faced with a significant input parsing or output formatting problem.

        I'm as guilty as the next programmer, but as a community it is quite shocking that we've known how to do automate that part of programming for 40 years an yet still prefer to cobble together something that reminds us of our first text books.

        1. Trevor_Pott Gold badge

          Re: Which Office product is at fault?

          I'm guilty too. Many's the time I've written a parser from scratch because the documentation for the existing ones in the language I was using was dense enough that I felt trying to grok it would take 3x as long as just writing the thing myself...

      3. Charlie Clark Silver badge
        Thumb Down

        Re: Which Office product is at fault?

        Exactly, what you don't want to do when serialising OOXML is create a DOM as DOM's use a lot of memory and are very slow. So string concatenation of one form of another is the preferred approach. This can still be wrapped in functions to ensure that tags are well-formed but that doesn't really help much, there are still a lot of things that can go wrong, and even more that the consuming application can complain about viz. the different behaviour of Word and LibreOffice to the broken file.

        Unfortunately, the OOXML developers forgot to learn the lessons of HTML and include a section on error handling.

        1. Shoot Them Later

          Re: Which Office product is at fault?

          Fine, Charlie Clark, I understand your reasoning for eschewing DOM - and let's face it, we've all built XML documents from strings from time to time. But that's not really my point - the case here is where a document has been written to disk that is not even well formed, let alone valid. Are people (and I'm looking at you, anonymous downvoters) really saying that it's too hard to check that the XML document that you've built is well formed at least before writing it? Really? This doesn't require DOM to do - there are faster ways to check that a document is valid and/or well formed.

          Maybe I'm somehow unusual in this, but I would far prefer my office software to warn me that there was a problem (and give me the opportunity to cut and paste my work to a fresh document, or make a backup of the existing file) than to silently write an invalid file to disk.

          I still maintain - and it shocks me that this seems to be somehow a controversial view - that there is no real excuse for a piece of released end user software writing XML files that are not even well formed.

          Full disclosure: despite from time to time being obliged to interact with Lotus Symphony - which I have always found to be an unpleasant and unrewarding experience - I am by no means a hater of LibreOffice. I am also no lover of Microsoft (particularly Word for Mac 2011 with its fun habit of crashing when I try and paste formatted text).

    3. Trevor_Pott Gold badge

      Re: Which Office product is at fault?

      I have had both Word and Writer cause the issue in subtly different ways. A little bit of research on the internet shows me that every single word processor that I can think of which handles DOCX has (or had) at least one bug that can cause this error.

      Thus the error is not restricted to any one product but is in fact a common style of error relating to how badly people write XML parsers.

      1. Shoot Them Later

        Re: Which Office product is at fault?

        There is no real excuse for writing malformed XML to disk, bug or not. The only exception I'll give is an error in the actual writing process (such as unexpected truncation). There are plenty of good XML libraries out there which will (should) not screw up as described if used properly. It really is not hard at all to check that an XML document is both valid and well formed before writing it to disk, passing it off to the next hapless program or whatever it is you are doing with it.

        It's this sort of basic laziness - failure to sanity check your inputs/outputs even when it's trivial to do so - that is an endless source of annoying and avoidable bugs. I'm not just singling out LibreOffice - I work for a company that writes an awful lot of software, and much more time than I would like is spent dealing with this sort of issue. (I can understand why it happens though - if sanity checking and general attention to detail doesn't win you many prizes in the feature-based development cycles, why bother...)

        I'm not trying to defend MS either, or their formats - but whatever you say about them, an XML doc is an XML doc.

    4. Anonymous Coward
      Anonymous Coward

      Re: Which Office product is at fault?

      Trevor, from the 1st para it feels like you're blaming MS Office, where in fact LibreOffice is the culprit behind the original issue

      Actually, the culprit is IMHO MS for creating that horrific excuse of a format. Over the years I have seen Word fall over in a number of ways (typically as a document grows), partly because of that incredibly stupid idea of carrying formatting across in cut & paste. The usual fix for that was to load it up in LibreOffice, clear it up and export it, after all would be well.

      In my experience, LibreOffice does simply not play well with MSOOXML formats (hey, there's a surprise), and we have thus decided to avoid MSOOXML format altogether. The result is that things just work, and if a client really wants that format, we have a copy somewhere on a box that will convert. But internally we won't touch it - we cannot afford the time it takes to fix teh problems it causes.

      1. Robert Harrison

        Re: Which Office product is at fault?

        "...Actually, the culprit is IMHO MS for creating that horrific excuse of a format..."

        So it's the format's fault the code didn't work eh? :-)

        (Not that I have any particular love for OOXML myself)

    5. Anonymous Coward
      Anonymous Coward

      Re: Which Office product is at fault?

      "I can't argue with LibreOffice's price tag :-)"

      If your time is of no value....

      1. Trevor_Pott Gold badge

        Re: Which Office product is at fault?

        Detail exactly what about Libre Office requires more time input than Microsoft Office? Please also account for the time that must be spent working out the licensing required to appropriately license Microsoft Office (both in VDI and non-VDI environments, where multiple devices are involved, where more than 5 devices are involved and where devices are used both at home and in an office environment.) Additionally, factor in the time require to manage said licensing as well as the time required to generate the income to pay said licensing.

        ...or are you just completely full of shit?

        1. jzlondon

          Re: Which Office product is at fault?

          > "Detail exactly what about Libre Office requires more time input than Microsoft Office? "

          Staff training and support?

          Macro conversion?

          Niggling formatting issues when sharing documents with MS Office users?

          Don't get me wrong. I love that Libre Office exists, but that's not inconsistent with recognising that it's not automatically cheaper or better than MS Office. Separate concepts.

          1. Trevor_Pott Gold badge

            Re: Which Office product is at fault?

            It takes more training to move from Office 2003 to "fucking useless ribbon bar" than it does to LibreOffice.

            No macros here, or on any of my client sites.

            File --> Print --> PDF Printer.

    6. Anonymous Coward
      Anonymous Coward

      Re: Which Office product is at fault?

      Also Word is not "dying" at all - which would imply a crash or something alike. It simply points out politely it can't open the document - why and where teh error is - and exits.

      It is correct to stop loading a document you can't process properly and you could corrupt more.

      1. Trevor_Pott Gold badge

        Re: Which Office product is at fault?

        "It is correct to stop loading a document you can't process properly and you could corrupt more."

        No it's correct to open it in read-only with only the option to save to a different file name...so you can get what you can from it.

  10. Anonymous Coward
    Anonymous Coward

    For the sake of completeness...

    Could you add a paragraph where you compare your wife with Maven so we can get properly outraged?

    1. Trevor_Pott Gold badge

      Re: For the sake of completeness...

      I don't know who Maven is. I can compare her to my bearded dragon, however. Both are passed out on the chair absorbing sunlight. I think they're plants or something. Require sunlight to recharge...

      1. Anonymous Coward
        Anonymous Coward

        Re: For the sake of completeness...

        I can compare her to my bearded dragon, however

        OK, you have now officially wandered well across the world of euphemisms into a place I don't want to be near when lightening strikes. Nice to have known you :)

        :)

  11. nascentmc

    Just a simple thanks. I know that will get me out of a hole one day.

  12. Anonymous Coward
    Anonymous Coward

    Did you try the obvious?????

    http://windows.microsoft.com/en-gb/windows/previous-versions-files-faq#1TC=windows-7

    1. Trevor_Pott Gold badge

      Re: Did you try the obvious?????

      Yes. You do realize that Previous Versions only triggers on a pre-set schedule or with Windows Backup, hmm? We save to Sync.com, which contains a version history of our files. The issue is that the error crept in around page 2, but we didn't realize it until page 32. Even had we gone back to the "last known good" version of the file, we would have lost 30 pages - the better part of three weeks' - worth of work.

      That was explained in the article in detail. Neither Previous Versions nor Sync nor Dropbox nor any other version control mechanism could have prevented this.

      1. Stevie

        Re: Did you try the obvious?????

        "Even had we gone back to the "last known good" version of the file, we would have lost 30 pages - the better part of three weeks' - worth of work."

        I don't understand (and want to understand) how this document could be worked on for 30 days in OfficeLibre Write, have a problem in page 2 - by inference from the quote, authored some 30 days before - and have:

        a) The document displaying these pages when the article says the text truncates at the error (to paraphrase what was written)

        b) Only one version backup

        Is it the case that the document was open and unsaved for the 30 days in question?

        1. Trevor_Pott Gold badge

          Re: Did you try the obvious?????

          The document was open and unclosed for 30 days. It was saved repeatedly. Thus there were many versions of it in the version control document, however, the error would not be noticed until the document was closed and reopened.

          In other words: the word processor - be if LibreOffice, Word, or any other that is capable of saving files with improper XML - is perfectly capable of allowing you to continue editing the document after creating the XML error, so long as you don't close and reopen the document.

          This means you get as many versions as you want just by mashing the "save" icon on a regular basis, and/or using autosave...but they all will contain the error from the moment the error was introduced. Which, in this case, was around page 2.

          Thus backups don't help. You have:

          1) Initial document created.

          2) Five saves until error is created on page two.

          3) 600+ saves after that all that have [good document] + [XML error] + [more good document].

          As long as it contains [XML error] it won't open.

          1. Stevie

            Re: Did you try the obvious?????

            "The document was open and unclosed for 30 days"

            Ah. I went back through the comments and located another where you discussed this, which I hadn't seen.

            Well, keeping documents open for long periods has been recognized as a source of problems since Og Higgins stormed into Thrug Eisenberg's cave (stepping on the office pterodactly's tail in his rage) and demanded his desktop icons be restored in 12 billion BC or thereabouts (dates are disputed, mostly by people who know computer history, but the bad-ideaness precedes the advent of DOCX by a decade or more for sure).

            Personally, I'd suspect the virtualization as playing a part in this sad tale if only because the umptytump ways a memory location gets remapped to another in such systems begs for the sorts of "lost bytes hiccup" issue this seems at heart to be.

            Though I know from personal aggravation that earlier versions of OpenOffice Calc had a disturbing way of losing track of memory in a laptop where it was the only application running. Formulae would stop working and mysterious table lookup errors and function syntax errors would be thrown. Some would clear if the application was closed and re-opened, some would only clear if the cells involved were cut and pasted into different places, then moved back again. What did I expect; it was free (tm) and I was Pushing It.

            Again, kudos for the fix.

            1. Trevor_Pott Gold badge

              Re: Did you try the obvious?????

              I can say absolutely that virtualisation wasn't an issue here. 1) Virtualisation doesn't work that way. 2) VMs are all granted static blocks of RAM in my config. 3) Looking at the raw XML it became pretty clear that she had attempted to create a hyperlink, thought better of it and cancelled. (Asking her confirmed this.)

              Which meant that LibreOffice created the <Hyperlink> tag but didn't erase it when the cancel was hit. It is likely a specific error related to trying to add a hyperlink to text that had been formatted in the way she was formatting it (blue + italics + something else.)

              It absolutely isn't a glitch in the matrix, issue of the OS, hypervisor, etc. It's a bug in the code. Just like that oMath bug in Word, or the many similar ones in Excel. Some aspect of the application made a commit to the XML structure when it shouldn't have, expecting that you would then follow through with something, or that if you pressed cancel it would "un-commit" the changes made. It's a bad design choice...but apparently a reasonably common one.

              If anything, this incident should be used as praise for XML-based file formats. OOXML is something that can be edited in this fashion. .DOCs could not. That's a step forward.

              1. Stevie

                Re: Did you try the obvious?????

                I agree. Not much you can do with a malformed DOC but try and Save As RTF in the hope it is a formatting string at the root of the problem, and that throwing that away will induce legibility once more.

  13. Anonymous Coward
    Thumb Up

    What @nascentmc said, thanks, this is a good article.

    1. Trevor_Pott Gold badge
  14. Terry 6 Silver badge

    Open documents

    Actually, the idea of leaving a working document open when not in use leaves me struck with a terrible feeling of dread. Maybe it's paranoia, but I can't help the feeling that any document left open when I'm absent/asleep/watching TV will somehow change itself, vanish into the ether taking the saved version with it or randomly freeze.

    1. Trevor_Pott Gold badge

      Re: Open documents

      First time it's happened to either me or my wife in a decade. Seriously, we use persistent VDI for everything. The server's rock solid. We can just disconnect our RDP sessions and walk away. Documents are versionned to Sync.com. Who would've thought something like this could happen, eh?

    2. Roland6 Silver badge

      Re: Open documents

      Whilst many people in IT who had experience of early versions of Windows - remember the celebrations when Windows was able to run for 30 days without falling over or requiring a reboot? would also be wary of having something open for a long period of time, it is something I expect we will see more of.

      With Word et al doing auto-saves every so often (and Windows requiring fewer reboots) and devices that are effectively always on many normal people will walk away from their systems, leaving things running, expecting to return and resume where they left off. I expect the sort of problem Trevor is alluding to, becoming a more frequent occurrence.

  15. sisk

    Been there

    I've done this before to, more than once. I've never quite figured out how Word or LibreOffice can manage to save an invalid XML file (I've seen them both do it), but it seems to happen about twice a year around here. Which when you consider that we have somewhere around 900 users saving Word documents several times a day isn't that bad I suppose.

  16. Stevie

    Bah!

    Hmm. Good work.

    But why go to so much trouble trying to pin this (in the reader's mind's eye) on MS - by the headline which strongly suggests the problem will lie with yet another DOCX issue and by "the most famous example" which is still pretty obscure to be honest - when what you are really up against is an OfficeLibre Write bug?

    10 points for style but minus a couple of hundred for mendacity.

    1. Trevor_Pott Gold badge

      Re: Bah!

      Word does it too. This particular example involved LibreOffice, but after research I'm pretty sure every single word processor out there has at least one variant of this bug.

      1. Stevie

        Re: Bah!

        "Word does it too" (1)

        Does what? There doesn't seem to be much in the way of analysis of how the document got malformed. I understood from reading the article the clever way you went about fixing the error (and Kudos for the many helpful suggestions) but the sequence of events that caused the error are not discussed at all, so one can't say more than Word has been known to malform an XML document - which you'd already said in the article. There is no way I can, from what you wrote, avoid the problem when using OfficeLibre Write since I don't know what the actual problem is.

        "Word does it too" (2)

        *shurgs* My point is that this story is about an OfficeLibre Write bug, NOT a MS Word bug, which the wording of the Headline and the tone of the article wants the reader to take away - in my opinion (and as a reader, my opinion is the one that counts since it indicates your message is not going over as you intended).

        The only time you mention the culprit software, you then go on to exonerate it partly by saying how it will at least open the document (while the subtext that MS Word, by not opening a malformed document written by a non-MS product is somehow a problem, is front and center).

        Don't get me wrong: I use OpenOffice products myself (though because I have some very sophisticated Calc stuff I use a lot I am not remotely spurred by PC politics to chance an uckfup in my formulae by switching to the more socially-redeemable OfficeLibre at this stage thank you very much). But this sort of backhanded evangelism is a bit cringe-inducing and mendacious.

        DOCX disaster recovery: How I rescued my wife from XM-HELL wished on her by OfficeLibre Write.

        Fixed it.

        1. Trevor_Pott Gold badge

          Re: Bah!

          Word does it too. http://support.microsoft.com/kb/2528942/en-us

          There. It's a link. It was in the article. It was in this thread. It describes an issue where Word causes exactly this kind of error, with a different object.

          This is not the only error where Wor fails to write XML into OOXML formats appropriately. I am aware of at least three others.

          Thus the article is not about either Word or LibreOffice. It's about the CLASS OF ERROR. Any word processor can create it. It isn't about "bashing" any particular word processor or "exhonorating" any one. They all do it, in subtly different ways.

          1. Stevie

            Re: Bah!

            Apologies. Apparently I have my own firewall nanny-induced malformed markup issues to contend with. Did not see the link in the main article (but can see the non-clickable text in the comments - bizarre).

        2. This post has been deleted by its author

          1. Trevor_Pott Gold badge

            Re: Bah!

            The oMath issue is simply the most famous (infamous) example of this. Also: lots of organizations have unpatched Office installs. I hope this encourages some folks to patch. AFAIK, Office 2003 and 2007 require hotfixes, not patches. There are still similar bugs in Excel, and for all I know the specific issue in LibreOffice has been patched by now. (She was using an older version, after all.)

            This has nothing to do with which productivity suite you use. It is about how the error can occur in the format. Any application can cause it. Now you know about the error and know what to look for if/when it hits you.

            If you have a religious issue with Open Source software - or Microsoft, or whomever - take it elsewhere.

          2. Paul Crawford Silver badge

            Re: Bah!

            I have seen Word fsck-up on embedded equations and occasionally on embedded images on EVERY version from 95 to a fully-patched (as of a few months ago) version of Word 2010, that is 15 years of at least one unfixed bug!

            Also seen crap from OpenOffice/LibreOffice.

  17. Anonymous Coward
    Anonymous Coward

    The wife was using an old version of LibreOffice Writer (v4.1)...

    That's your problem right there. The entire suite (and I include OO as well as LO) is a fuck-up.

    1. Trevor_Pott Gold badge

      Re: The wife was using an old version of LibreOffice Writer (v4.1)...

      Really? Then why do I get more bizarre errors with Microsoft Office than anything else I use? What office package do you recommend that is absolutely reliable? This is the first error I've seen in Libre/Open Office in over a decade. I've seen a hell of a lot more out of Microsoft Office in that time...

  18. Maty

    sledgehammer meet nut

    Apart from the fact that a growing number of computers don't run anything capable of handling Microsoft's document-flavour-of-the-week, in the vast majority of cases, Word adds unnecessary complexity to basic text.

    I'd guess 95% of all docs written in this overblown monster of a program can and should be written on wordpad. The number of projects actually needing Word are about the same as those needing specialist graphic or CAD.

    I find HTML-Kit292 best for examining docx(tm) and extracting information (such as my next dentist appointment). Open Office handles all the other perversions of the document format that Microsoft has invented over the years.

    1. bob, mon!
      Holmes

      Re: sledgehammer meet nut

      For most of the word content I see, notepad would've been good enough.

    2. Stevie

      Re: sledgehammer meet nut

      "I'd guess 95% of all docs written in this overblown monster of a program can and should be written on wordpad. The number of projects actually needing Word are about the same as those needing specialist graphic or CAD."

      You shouldn't project your own use of the software on the public.

      Just because the IT world needs only a typewriter (and gets one for free with every copy of Windows sold) doesn't mean that "95%" of the world sees it that way.

      Legal offices, Doctor's offices, hell, just about any professional office would need more from their software than wordpad can produce - which is why Wordperfect was written.

      Not everyone uses Word for bouncing memos around and submitting this weeks specs to the boss.

  19. Vince

    "What do you do when a critical Word document won’t open?"

    Restore from the backup. Job done.

    Although given your more-effort-than-needed approach to the anti-spam setup recently, I'm sure you've used the worst possible concept of a backup too.

    If "restore from the backup" is not a valid option I suggest you should retire.

    1. Trevor_Pott Gold badge

      Okay, smartass, how does restore from backup save you in this situation? If you are so cosmically empowered with systems administration skills, explain to me exactly how it resolves the issue.

      My "backups" in this case consisted of a complete version history right back to version one of the document, care of Sync.com. Windows Previous Versions had a copy taken twice a day, every day, since version 1. The auto-rarchiver had every fifth or sixth version of the document as it landed on the NAS. None of that helped.

      The "last known good" copy of the document had only two pages of text. Thirty pages of text - about three weeks' worth of work - had been written to the document after the error was originally introduced. Using the backups would have wiped out thirty pages worth of work.

      So how, exactly, do "backups" help with that?

      Given the lack of reading comprehension you display attempting to read a simple article, I am convinced you are unable to read (and remain focused on) a technical document of any length. That tells me that it is you who should be retiring. Preferably to some place where they keep the electrical sockets covered and only let you use dull sporks to eat.

      Seriously, is "restore from backups, job done" the level of service you give your customers? Your own wife? "No thinking about the situation, simply push the button and don't think about it." Not only must you be one of the worst sysadmins to employ but your sex life must be akin to trying to fuck a Dalek!

      1. No. Really!?
        Pint

        All Worthwhile

        The rate of reading comprehension on this article amazes me, and provoked you to the above post.

        However, for me it was all worthwhile for that last line.

        So much laughter here.

        Ta, Trevor!

    2. This post has been deleted by its author

  20. theOtherJT Silver badge

    ...and this is why I write basically everything in either notepad or nano depending on the machine, save it as raw text and only open it in something more complicated at the very end when it becomes time to pretty up the formatting.

    1. Primus Secundus Tertius

      @theotherjt

      Maybe I live dangerously, but I find One Note very good for concocting the first draft of a document.

      Mind you, there are no styles or themes in the Word documents it produces: everything is strictly low level font face and font size.

    2. Intractable Potsherd

      @theOtherJT

      I've been thinking this for a while, but never got around to implementing it. It would be good for me because I tend to mess around formatting as I go along, instead of just getting the text down as quickly as possible. This nasty little error, which I haven't experienced (yet??) has just about persuaded me to change.

  21. Will Godfrey Silver badge
    Unhappy

    Something very wrong here.

    Many moons ago when BBC BASIC was all the rage, there were a number of programs written that could recover a badly corrupted BASIC file (I even wrote one myself). When they found a corrupted part they simply stepped forward one byte at a time until they found a line that scanned correctly. The really smart ones (like mine!) would then collect up the garbage sections and rewrite them as DATA, so even that had a hope of then being manually restored.

    So, the question is, why is nobody doing the same today? There is a huge amount of XML out there. It's already in text form so it should be easier to sort out than tokenised programs.

    1. Trevor_Pott Gold badge

      Re: Something very wrong here.

      Hell if I know. I would have thought this type of application would be pretty basic, but nothing I used could do it. The XML parser in Notepad++'s plugin store just said "error on line 2". Helpful. Visual Studio was equally useless. None of the "we'll fix your broken Word document" applications were worth a damn either.

      It's really odd because if someone had written the recovery app exactly as you suggested it would have worked just fine. Market opportunity for you?

    2. Al Jones

      Re: Something very wrong here.

      If this error was as common as corrupt BBC Basic programs (and, to be fair, corrupt Speccie, C64 and Amstrad CPC64 programs), then no doubt there'd be a number of tools out there to fix it.

      But it's not that common a problem. Without taking away from the time and effort Trevor put into figuring out how to actually open up the file and fix it, the process is relatively straightforward, and it just needs an XML parser with a slightly different focus (highlighting the hanging tag) to make it easier, but the problem doesn't crop up often enough that the people with the necessary skills have been motivated to write code that will make it easier to fix this type of problem when it does crop.

      I kinda knew somewhere in the dusty recesses that an OOXML file was just a .zip file, and Trevor's description of the process that he used will probably mean that I'll remember that fact if I'm ever motivated enough to try to recover data from a corrupt OOXML file, but I'll probably end up trying whatever XML parsers I have to hand at that time, rather than remembering which ones worked best for him.

    3. Kanhef

      Re: Something very wrong here.

      It doesn't seem like this would be particularly hard to do; you could probably borrow a lot of the code from web browsers, which already do a fairly good job of handling malformatted HTML. There are three main types of XML error that I can see:

      Orphaned tags with no matching closing or opening tag, which is what Trevor's problem seems to have been. Easy enough to delete or escape as text.

      Transposed tags, such as < a ...>< p >< /a >...< /p >. This would take a bit more work to detect, but the fix is obvious.

      Broken tags, particularly missing right angle brackets. Escape the left bracket and recheck the document, as this will probably create an orphaned closing tag.

      1. Vic

        Re: Something very wrong here.

        you could probably borrow a lot of the code from web browsers, which already do a fairly good job of handling malformatted HTML.

        For cleaning up malformed HTML, there's one product without peer IME - BeautifulSoup

        I've no idea how it copes with XML...

        Vic.

      2. Michael Wojcik Silver badge

        Re: Something very wrong here.

        Transposed tags, such as < a ...>< p >< /a >...< /p >. This would take a bit more work to detect, but the fix is obvious.

        Detecting that situation, and determining how to fix it, can easily be done with dynamic programming; it's basically a variation of the Minimum Edit Distance algorithm (which is also the basis for the most common diff algorithm).

        Broken tags, particularly missing right angle brackets. Escape the left bracket and recheck the document, as this will probably create an orphaned closing tag.

        A missing right angle bracket (right chevron, greater-than sign1) either should terminate the last tag of the document, or should terminate a tag that's not the last one. So either you'll encounter a left angle bracket (in a non-quoted context2), or you'll hit end-of-input, while looking for the right angle bracket. In either case, the erroneous tag is the one begun by the preceding left angle bracket.

        That being the case, the parser can simply refrain from pushing the malformed tag onto the stack3, and it will hit either the closing tag or end-of-input in due course. There's no need to explicitly escape the left bracket.

        This also handles the case where an unescaped less-than sign was written to the XML document. Unescaped right-than signs are trivially errors if they appear in PCDATA; if they appear within a tag, the actual right bracket for that tag will produce an error, so the user knows one of the two is wrong, and if there's a DTD or schema available the situation's rarely ambiguous, if the tool wants to be a bit more ambitious.

        Malformed entities are almost always trivial to detect (I can't offhand think of a case where they aren't). Malformed CDATA sections are tougher, particularly if the closing CDATA sequence is lost.

        1In traditional typography those aren't the same character, of course, but in XML and many other SGML applications the ASCII greater-than sign is interpreted as a right angle bracket or right chevron.

        2That is, outside CDATA. In other XML quoting contexts, notably attribute values, angle brackets must be escaped using character entities.

        3Conceptually - there may not be a "stack" in the classic data-structure sense, but it acts like a stack however it's implemented. XML can be parsed with a PDA.

  22. Paratrooping Parrot
    Paris Hilton

    I am still on Word 2003, because I hate the ribbon thingy. Does that have the same problem when using Doc format?

    1. Trevor_Pott Gold badge

      Actually, it's worse. .doc files are binary blobs, not XML. If they had a similar weird error writing metadata to the file it would simply be corrupt and unrecoverable. That said, you can get around this in two ways:

      1) Configure Office 2003 to save in DOCX/XLSX/etc.

      2) Use Word 2010 and uBit menu to get the menu back. Delete all tabs that aren't "menu" from your config. Suddenly, it looks just like Office 2003!

      1. Roland6 Silver badge

        re: Does Word 2003 have the same problem when using Doc format?

        Yes I've encountered similar errors with Word using the Doc format.

        A lot depends upon what happens when you try and close the corrupt document in Word. From memory I seem to remember if you use "Save As" ie. create a new copy Word will discover the corruption and tell you that it is unable to save. I've worked around this by saving to a different format (eg. RTF) or even cut-and-paste into a new document - which will also error when you try and cut-and-paste the segment of document containing the malformed element.

        Things get really difficult when you simply click "Save" and close Word, as then you no longer have an open version of the corrupt document.

        Like Trevor, for many problems, I've found re-opened a word document in OpenOffice/LibreOffice and saving it as a new file sorts things out. But there are a group of problems which this doesn't work.

    2. Paul Crawford Silver badge
      Unhappy

      I have had errors on trying to save in Word saying the document was too big to save - think some corrupted embedded objects were reporting '-1' as the size so 4GB or something.

      Sadly only option was to delete said object, save, start gain and re-embed it. I just hate Word...but it is probably the least-sucking word processor :(

  23. Fuzz

    change file extension

    Why would you change the file extension when your preferred zip client is 7zip? just right click, 7zip, open archive.

    1. Trevor_Pott Gold badge

      Re: change file extension

      Because not everyone chooses 7Zip and I was trying to write it targeted at everyone. I don't trust the native Windows Zip client to repack the document properly, so I wanted to subtly hint that getting a proper Zip client would be a good plan, however, if they chose a different client I wanted to have some instructions in the article for how they could go about getting into the file.

  24. Clive Galway

    Simpler way to fix DOCX files...

    Open it in Open Office and save.

    1. Trevor_Pott Gold badge

      Re: Simpler way to fix DOCX files...

      OpenOffice didn't read it either. Had the same issues as LibreOffice.

  25. Don Jefe
    Thumb Up

    Unexpected Compliment Error

    I just fucking hate low level technically oriented articles. I would rather read the instructions on a box of tampons at least those have stick figure illustrations for entertainment.

    But not this time. This has to be one of the most well written technical type articles I've ever read on El Reg, anywhere really. Just wanted to thank you for the read. I enjoyed it.

    I've also integrated part of it into my troubleshooting protocol. My default response to any and all technical questions in the future will be 'The problem seems to begin at Line: 2, Column: 12464. Start there and get back to me with what you find there'. Ideally the question will be about a machine with no computer involved so when they come back and say they can't find Line: 2, Column: 12464, then I can make them go away by saying 'Until we know what's at Line: 2, Column: 12464 we can't do anything. Keep looking and don't come back until you find it. I'm a busy man'.

  26. Anonymous Coward
    Anonymous Coward

    I had once...

    ..a Word .doc file that could only be saved if you closed and Word asked (do you want to save? Y/N) because if you saved first it would crash.

    Even copying and pasting in a new document would cause the same behaviour. The only solution was to copy the raw text into notepad (therefore cleaning all the formatting junk) and then pasting the clean TXT into a new, unformatted .DOC.

    Up to this day I run into some office files with weird idiosyncrasies like that; they take forever to open if you click straight on them, but if you open Word prior without a file, they open in 1 second. Or tables inside Word that "are too complex to be written" or something like it.

  27. FrankAlphaXII
    Thumb Up

    With having to write papers all the time for both of my jobs, as well as for my master's degree, this is a great set of tricks. I used to just chuck the file and write it again when this would happen every now and then. Thanks Trevor!

  28. Anonymous Coward
    Anonymous Coward

    Simple solution I've always used

    If I change a document, I save it as document_06092014 (reverse the 06 and 09 if you're not American ;))

    We have the ability to store terabytes of data, so "wasting" it with a bunch of old versions is cheap insurance for problems like this. It takes about 4 seconds extra when saving, so one time avoided going through the TL;DR process in the article more than makes up for it.

    1. Trevor_Pott Gold badge

      Re: Simple solution I've always used

      As per eleventeen squillion discussions about versionning in this thread; your solution wouldn't have helped.

      1. Anonymous Coward
        Anonymous Coward

        Re: Simple solution I've always used

        Why not? That way you lose only those changes you made in the last update - it was able to load to make those changes, so it'll load again (after possibly downgrading the software if a more recent update has a bug that prevents loading of your previously OK n-1 saved version)

        If you spend hours making the changes, that's a bad thing, but better to lose hours of updates than months or years worth.

        1. Trevor_Pott Gold badge

          Re: Simple solution I've always used

          Since actually reading the article or any of the other comments in this thread is to tedious for you, I'll repost a previous comment here.

          The document was open and unclosed for 30 days. It was saved repeatedly. Thus there were many versions of it in the version control document, however, the error would not be noticed until the document was closed and reopened.

          In other words: the word processor - be if LibreOffice, Word, or any other that is capable of saving files with improper XML - is perfectly capable of allowing you to continue editing the document after creating the XML error, so long as you don't close and reopen the document.

          This means you get as many versions as you want just by mashing the "save" icon on a regular basis, and/or using autosave...but they all will contain the error from the moment the error was introduced. Which, in this case, was around page 2.

          Thus backups don't help. You have:

          1) Initial document created.

          2) Five saves until error is created on page two.

          3) 600+ saves after that all that have [good document] + [XML error] + [more good document].

          As long as it contains [XML error] it won't open.

          Had your advice been followed we'd be back to "two pages of document" instead of "32 pages of document." Interestingly enough, those exact same two pages of document were what Writer could read before it encountered the XML error.

          In other words, your advice doesn't prevent the problem is actually useless because of how Libre Office treats XML-flawed documents.

          What's more, as was explicitly stated several times in both the article and various posts here in the comments, we did have a versionning system in place. We just don't have to do it manually, like primitives scratching on stone tablets. Applications can autosave now. And you can mash the save button. Any change you make gets sent to Sync.com/Dropbox/Livedrive/etc that then versions it for you.

          In fact, the wife actually periodically WOULD save the document to the desktop, to Dropbox, to the local NAS, etc...all without the requirement to close the application.

          Cheers.

          1. Anonymous Coward
            Anonymous Coward

            Re: Simple solution I've always used

            OK, well I apologize for not having read all the comments and understanding that the file was open for 30 days.

            I think the solution to THAT problem is obvious now, but I admit I wouldn't really have expected that to be a problem, either.

          2. cyberelf

            Re: Simple solution I've always used

            "as was explicitly stated several times in both the article and various posts here in the comments, we did have a versionning system in place" .. "So long as you never close the word processor you’ll just keep saving corrupted versions of the file with more and more data after the corruption point."

            A versionning system that repeatidly saves a corrupt file without telling you isn't of much use now - is it ?

            1. Trevor_Pott Gold badge

              Re: Simple solution I've always used

              And your proposed solution is?

    2. Charlie Clark Silver badge

      Re: Simple solution I've always used

      document_06092014

      Always use ISO (YYYY-MM-DD) if you want anyone else to be able to make sense. Sorts in the right order as well.

    3. Michael Wojcik Silver badge

      Re: Simple solution I've always used

      If I change a document, I save it as document_06092014

      Because why have a computer take care of repetitive, easily-automated tasks, when we can make the user do them instead?

      If there's any advantage to the user in saving every version of a document under a different name, then the software is broken.

      Now, in this case, the software is broken. The folders-and-files, explicit-save, overwrite-the-previous-document user interaction model is fundamentally, mind-numbingly wrong. (Dave Platt points to some of the ways its wrong in his classic Why Software Sucks, but he doesn't go far enough.) Some aspects of it are holdovers from days when resource constraints severely limited how users could manage their data; others are just the result of intellectual laziness.

      The Sugar OS (the userland for the OLPC laptop) made a couple of tentative steps in a better direction, saving all work automatically and listing it in the "journal", and there have been other attempts in other fringe OSes. But the main desktop OSes are still endorsing this abysmal approach to handling content.

      Instituting awkward manual protocols that impede frequent saving (do you save your documents only once a day?) and proliferate user-visible metadata (which may not match system metadata, and then how do you interpret that skew?) seems like rather a case of, at best, one step forward and one-minus-epsilon steps back.

      1. Anonymous Coward
        Anonymous Coward

        Re: Simple solution I've always used

        Michael, I agree if computers were perfect I wouldn't need to do that. Wouldn't need backups either, as they'd use AI to know what stuff I delete I'm going to want again, along with finding free cloud backup services while I sleep and updating my files to the cloud the second they change.

        I look at it as cheap insurance. If it took 10 minutes to do I wouldn't bother, but it is quick and painless, and provides a simpler way to look at an older version of a document. If I wanted to use the built-in versioning I would have to look up how to do so. It would take a lot of 4 second manual version saves to add up to the minutes I'd take doing that, so I may come out ahead anyway :)

  29. Denarius
    Thumb Up

    just had this same problem

    job app update, as expected by Murphy, urgent and document won't open after update for open and close check. Did similar to Trevor initially, then went back to time stamped version of two days before and redid edits Office Libre was used. Much appreciated the hints for next time Trevor.

  30. herman

    Next time, maybe try 'tidy'?

    DESCRIPTION

    Tidy reads HTML, XHTML and XML files and writes cleaned up markup. For HTML variants, it detects

    and corrects many common coding errors and strives to produce visually equivalent markup that is

    both W3C compliant and works on most browsers. A common use of Tidy is to convert plain HTML to

    XHTML. For generic XML files, Tidy is limited to correcting basic well-formedness errors and

    pretty printing.

    1. Trevor_Pott Gold badge

      Tried it. It's integrated into a plugin for Notepad++. It refused to try to pretty the XML because of the XML error, and simply told me what line it was on. :( Got another IDE you use Tidy in that I can try?

  31. ProperDave
    Coat

    I must be the only person in the world that often composes documents in notepad, and only ever resorts to an Office suite when it's time to make it look pretty.

    I did that all the time at University back in the early naught's. Especially for things like my thesis - where Word would get so laggy on the poorly speced Compaq University machines as the document got larger. I still do it to this day - with notepad being so quick and easy to type in, and only resort to Word for the final draft.

    This article is brilliant though! Well done Trevor! I'm sure I'll be using the advice in future recoveries of family member's files. :(

    1. Denarius
      Happy

      @ProperDave

      nope. Use vi for much of my document drafting, unix or DOS version. It has no smarts so I trusted it to not lose or mangle things. Even a server crash left text in /var/preserve or accessible via vi -r filename.

      1. Vic

        Re: @ProperDave

        Use vi for much of my document drafting, unix or DOS version. It has no smarts

        Vi actually has a *lot* of smarts[1]. They're just not turned on by default...

        Vic.

        [1] A colleague of mine would show me many natty features of her preferred IDE. I'm not entirely sure if she was pissed off or secretly impressed that I could duplicate them in vi...

  32. a_yank_lurker

    Useful

    The religious arguments are getting old so back to the point of the article. If you have buggered xml one has some decent options available to fix the problems. This is a very handy bit of knowledge that may allow someone to recover a borked document and look like a genius. ODF and MSOX formats are zipped xmi files so this trick is useful for both types.

    I suspect the problem is the LO parser incorrectly guessed the xml. The real culprit is the specifications for MSOX formats must be reversed engineered to some extant. This means there some rough edges where the interpretation of the internal MS spec is faulty.

    1. Dan 55 Silver badge
      Joke

      Re: Useful

      Is there any other kind of XML apart from buggered XML?

  33. Havin_it
    Coffee/keyboard

    Yup, got this particular T-shirt

    In my case it was ODF (well, ODG if we must be precise) that an internal invoicing app was spitting out in a template stylee by copying some field values into the base XML and zipping it all up. Only problem, the values often contained illegal characters that made LO puke. Recognising the problem didn't take nearly as long as beating all the data into shape :(

    I did howl a bit at this though, Trevor:

    >There are two ways to cheat. The first is use Visual Studio or Visual Studio Express. [...]

    Just two ways, really? VS, really? This seems a bit of a case of "When all you have is a hammer, everything looks like a nail" to me. You had the line+col locus, you had a distinctive string you could search for; either of these would have gotten you there in Notepad, for heavens' sake. OK, you don't get the code-beautification that might help fix it, but you didn't actually mention whether VS does that bit either, with an invalid file; and in any case, anyone who's gotten this far might reasonably be expected to have the chops to spot what's wrong quite easily.

    Other than that, good article - always good to raise awareness that self-help is sometimes an option with these formats (and a salutary warning to restart your wordprocessor once in a while!)

    1. Trevor_Pott Gold badge

      Re: Yup, got this particular T-shirt

      "There are two ways to cheat. "

      Where is the word "just" in that sentence. What am I, the fucking oracle? I am supposed to know every possible means of "cheating" that exists? I did talk about trying to grok the XML manually, without making it pretty...but apparently that's inadequate?

      1. Havin_it

        Re: Yup, got this particular T-shirt

        No, fair play, on second reading the intent is perfectly clear. I think I took a break before that sub-head and got the sense of that snippet arseways. I was reading it as "cheating at navigating to the locus", which wasn't what you were saying at all. Apologies for the unwarranted rabies (rabidity?).

        Out of interest (and as I'm more likely to have it at hand than VS), does Chrome actually direct you to the location of the broken tag as well?

        1. Trevor_Pott Gold badge

          Re: Yup, got this particular T-shirt

          Nyet. Chrome comes up with a completely WTF location that has nothing at all to do with the tags. Not the point in the XML where the closing tags should be, not the beginning where the hyperlink tag starts. What chrome reports as a location is disconnected from reality and utterly baffling.

  34. Zombieman
    Meh

    I haven't kept up with OOXML (is it still called that?) so I've don't know if Microsoft have managed to release a version that complies with the format properly yet - I know their own software wasn't compliant when Microsoft were "fast tracking" the standard (by some accounts it shouldn't have been).

    And a Microsoft only world isn't always safe, or even compatible with itself... Yellow or green from the default palette? Yeah the "new" format can't always distinguish (yes I've seen colours change just closing and opening files)... Open file, modify file, save file, looks successful, see no errors, close file, look on disk, is my file there? Not always...

    From the article, seems the "fault recovery" of both office suites might not be as friendly/helpful as they perhaps could - having said that, fault recovery is HARD to write and design the "best" way to handle errors.

  35. Terry 6 Silver badge
    WTF?

    Slightly off-topic

    I had a meeting with a school's special needs co-ordinator a few years back and she had her PC on in her office. In the course of discussions I discovered that she didn't know how to save her documents, so she left them on the screen until they were ready to print - and if she got called away ( which she often did) for the sake of confidentiality she turned the PC off at the switch- losing all her work.

    I offered to show her what to do. She didn't take me up on it. Or ask any of the many teachers in the school, or the admin staff - all perfectly able to help her. But then she didn't often take notice of advice about the kids either.

  36. cyberelf

    Recovering corrupt DOCX file ..

    You could have tried opening the file in the Microsoft Word Viewer. A built-in recovery mechanism in msOffice for such corupt files would also come in handy.

    "The class of XML error described above is absolutely insidious. If you are the type of writer who obsessively saves documents you are only digging your own grave."

    I normally save as version001, version002 etc., so I always have a working version to go back to. Something the current crop of 'computers' can't manage. Something that's been around since at least VAX/VMS.

  37. Tom 7

    f'in fck editor

    perfect round the office, perfect on the internet. A couple of lines of code and it's in your 'document control system' complete with 'version' control and the emergency 'rip the hard disk out and jump on it' when the goal posts move.

  38. Unlimited
    Windows

    XML Notepad and Open XML Productivity Tool

    That is all.

This topic is closed for new posts.

Other stories you might like