The government is to change the law so that all data released under the Freedom of Information Act will be fully accessible to computers. Cabinet Office minister Francis Maude told the Conservative party conference in Birmingham that the Freedom of Information Act will be amended so that all data released through FoI must be in …
Hang on a minute
If there's an IT spending freeze, who is going to do the IT work needed to implement this.
And it won't always just be a change in the file format you save output in either, it all has to be records managed through an ERDMS and content Management System, and a myriad of reporting tools for the numbers.
Government documents are often released as Locked PDFs to stop people doctoring the output, and making false claims about what they were sent. You can imagine what the Daily Frights would do with government spreadsheets they could massage.
>You can imagine what the Daily Frights would do with government spreadsheets they could massage.
The media massage the data anyway, to suit their own stupid viewpoints and tell lies. They often splash a story that is 1% truth and 99% extrapolated nonsense.
At least this way we can check for ourselves and it'll be obvious just how much is "news" and how much is "opinion".
Anybody else heard of XML...
If the government truly wants this data to be machine readable why not have it published in one of the many commonly used XML standards so it can be useful to the rest of us in web services, another thing the public sector would be well advise to go look into
Standards are wonderful
There's so many to choose from. To wit: "XML" is not a single standard. Much of the stuff that ends up encoded in "XML" is only barely "machine readable".
On another note: PDF is not the "visual only" format kable here purports it to be. I'd rather have pdf than excel because I have a much better chance extracting useful data from it. It depends on what you put into it. If it's pictures of scanned pages, then extracting the text is going to be not so very straight-forward. Otherwise, you can just copy-paste it, run it through a filter to get only the text, or what-have-you. It's a format geared for printing, and the specification is open. That last bit is empathically not true of excel sheets.
For XML, it all depends on how well the encoding was done on the XML layer, on the DTD layer, on the content layer, and so on. "XML" by itself is almost entirely meaningless. Much like CSV, only lost more complicatedly so. CSV doesn't establish a minimum level of complexity in libraries and/or external dependencies to just open the file and start to load it. For that reason alone I prefer simple plain text or CSV files over "XML", even before I ended up having to deal with some severely broken IT projects that thought the three letters X, M, and L were going to save the day.
You still need an architecture for your data and a formal specification for your protocols. If you fsck up your XML specification to the point where the colour of the text in your (.doc) specification becomes important for the semantics of your XML-packed data format, and then you forget to clarify both that and what the colours mean, you'd be wisest to shoot down the entire project and start over. And this time, don't outsource these design decisions to the lowest bidder from India. Hire someone comptetent.
Of course they didn't. It was XML, right? That's a magic bullet, right? Well, that was far from the only thing they did wrong. That half-year project is still dead in the water, years hence.
If the government cannot understand the limitations of a well-defined format of PDF, then it cannot be trusted to deal properly with something as nebulous and in need of additional fleshing out as "XML".
This is no surprise: Most real-world XML applications are disasters of unnecessairy complexity. It's just that the complexity sits where it usually doesn't hurt developers much, so they love its convoluted obese and redundant verbosity. As such, it is very much a buzzword for developers as well as project managers. If you like those properties in government, then yes, it is a great fit.
Yup, it's just like you said.
If the documents were scanned into PDF format from paper documents then they cannot be used effectively. One could use OCR to extract some of the information depending on how the original paper document was laid out, and then all you need to do is go through it and correct all the mistakes the OCR made.
If the documents were converted into PDF format from a word processer program, then it can be copied and pasted back to a word processer program.
I wonder if all the councils have all or most of their documents in electronic format?
FOI Action is only possible early in a governments life
Blair stated recently that he regretted some aspects of FOI legislation. He is not alone in feeling this way as the longer a government is in power the more angst it develops against letting the 'shareholders' know what is going on.
So it is fortunate that this new amalgamated UK government has done something early in it's term to improve Blairs idea of FOI.
In any event, the present UK government structure is unlikely to produce as many missteps as would a government formed from a single party.
Checks and Balances
As the availability of FOI information rises, the number of FOI requests granted will fall.
Whats wrong with pdf?
Some pdf files are perfectly machine readable - it all depends what was used to make them. If a scanned document is turned into an image and then printed as a pdf, it can't be machine read. If it is produced as a text or doc or other text based file, then turned into a pdf, basic text search is available.
As to the general principle - yes - I agree. I'm fed up seeing long FOI documents released as image files, that I can't cut and paste to make quotes from.
But the main problem is that there are so many ways for government to dodge the FOI Act. For example, the Home Office and Cabinet Office nowadays just bat embarrassing but perfectly reasonable requests away with a "vexatious" excuse at first time of asking. It never occurs to them that actually answering the occasional simple FOI request might prevent the questions being asked so often. As far as I know, being a "campaigner" is not a reason for losing your rights under the FOI Act. If the Home Office or Cabinet Office had been responsible for MP's expenses for example, we would NEVER have found out about the scams.
What - are we all just bookkeepers now?
csv is fine for simple datasets - what about more complex data - is there going to be a rdb standard? Or are the pols going to show their complete ignorance and suggest xml (I won't bother doing the downside of that as AC - "Standards are wonderful" has already done it)
And what about the tons of public sector information which is not datasets but reports, memos letters etc in various formats - whats the magic bullet for that lot - .txt - or that well known universal format .doc? What about the many departments and areas who still keep records in paper (more than you might think) and records that are paper or images because they were originally typed (on a typewriter) or handwritten (loads of forms in government you know) - oh I know - OCR.
The point of FOI is that it makes people entitled to know stuff - not to make life easy for lazy journos. How about we save the taxpayer the money and tell these leeches to do their own research and analysis - you never know - actually reading and understanding something instead of mashing it up - they might one day get to the point where they could be accused of knowing what they are talking about.
Oh, nice point.
Yes, all that's needed is either photocopies or scans... and why not scan anyway and print the scans in case the recipient wants that rather than tossed at them in an email. Then you can cache the scans and serve those documents the next time from the cache. And provided the government has some document tracking system. They have that, right? Right?
You can even centralise the storage, encrypted (by storing departement) if you must, for only the storer really needs to retrieve the cached documents since they get the FOI requests. Unless you'd like to centralising handling of FOI documents as well, only passing on requests to scan documents that haven't been scanned yet.
The USPTO uses TIFF in a very BMP way, which stands to reason, but may be a bit overkill here. For black-and-white documents some of the same trickery used in photocopiers would be permissible and then add lossless (or very slightly lossy, but certainly don't want to introduce more artefacts) compression to make transport easier.
I disagree about .doc, it and its pseudo-standardised pseudo-successor .docx were never intended to be and therefore are not interoperable or even fit for archive use. Maybe on an as-available on-specific-request-only basis, but because proprietary unfit for FOI service, certainly as a primary format. PostScript, or *shudder* rtf *shudder* would be a better choice on interop grounds alone. Which basically ends you up with pdf again.
If you want to do a honest OCR you'd have to provide the original scans too anyway, just in case the OCR fscks up -- and even google's does, after lots and lots of tweaking. I don't think a good combined document format exists (yet). For storing the scans, BTW, DJVU is a good choice, on purely technical grounds. It was pretty much designed to efficiently store scanned mostly-text stuff.
It's still a good idea to gather up all sorts of numbers and heaps of text for sorting and searching and selecting and whatnot else. But it doesn't really fit well within the FOI scope. I don't think it would be a good idea to cram them together, either. One is about ensuring simple access, the other is about analysis.
- SMASH the Bash bug! Red Hat, Apple scramble for patch batches
- A BENDY iPhone 6, you say? Pah, warp claims are bent out of shape: Consumer Reports
- eXpat Files 'Could we please not have naked developers running around the office BEFORE 10pm?'
- CoTW Emma Watson should SHUT UP, all this abuse is HER OWN FAULT
- Vulture at the Wheel Renault Twingo: Small, sporty(ish), safe ... and it's a BACK-ENDER