back to article Gov.UK to make its lovely HTML exportable as parlous PDFs

The UK’s Government Digital Service (GDS) has revealed it’s working on a tool that will export its web pages as PDFs. News of the effort comes at the end of a post that spends most of its time slagging off PDFs. “Compared with HTML content, information published in a PDF is harder to find, use and maintain,” wrote GDS …

  1. AlgernonFlowers4

    Print to PDF

    On webpage press Ctrl-P to Print, Select PDF as target printer and save as a file (.pdf)

  2. cbars

    Re: Print to PDF

    I think the key word is accessible. I'm not an expert but I imagine you need to structure the metadata in a particular format to make screen readers understand it; perhaps there is a dev around with a more technically complete answer....

  3. deive

    Re: Print to PDF

    Did this almost 15 years ago now, it is dumb, for the reasons stated, however I used https://xmlgraphics.apache.org/fop/ and it wasn't too hard. If your source is pretty well standardised....

  4. A Non e-mouse Silver badge

    Re: Print to PDF

    The problem with Ctrl-P is that many smart-arsed web designers implement different style sheets for print and screen because they think they know better. (Our in-house templates, for example, include the URL to links when printing web pages out.)

    The most reliable way I've found to print out a web page is, unfortunately, to screen shot it.

  5. stephanh Silver badge

    Re: Print to PDF

    You can even automate this with Chrome, since it can be invoked from the command line.

    chrome --headless --disable-gpu --print-to-pdf https://www.theregister.co.uk/

    See also: https://developers.google.com/web/updates/2017/04/headless-chrome#create_a_pdf_dom

  6. Anonymous Coward Silver badge
    WTF?

    Re: Print to PDF

    The issue isn't the generation of the PDF file. It's that they need to format the HTML in particular ways so that the generated PDF is accessible, functional, etc.

    Also, they need to convince the coloured pencil department to stop producing PDF files themselves as they're universally shit for accessibility. Produce HTML that can be converted and everyone will be happy.

  7. Anonymous Coward
    Anonymous Coward

    Re: Print to PDF

    it's not that simple as virtual pdf driver (I use it every day to save beeb articles for kids). Trouble is, they print EVERYTHING, when all you want/need is the specific text. Not the links on the left, not a photo SPLIT into two pages (as it the norm with pdf printing). Not that they want people to print pix! :)

  8. stephanh Silver badge

    Re: Print to PDF

    "It's that they need to format the HTML in particular ways so that the generated PDF is accessible, functional, etc."

    Actually Chrome's print-to-PDF is pretty good at this, frankly. The resulting PDF document is fully searchable, text can be selected, etc.. I presume this means a screen reader would be effective (since clearly the original text is preserved as text). Hyperlinks in the original HTML become hyperlinks in the PDF.

    If there is something missing, it would make sense to contribute it to the open-source Chromium codebase rather than invent a wheel with more corners.

    Of course, if your original HTML was sh*t from an accessibility POV to begin with, print-to-PDF is unlikely to improve upon the situation.

  9. Tom 7 Silver badge

    Re: Print to PDF

    Loose printed PDF. Oh! Look its still on the fucking internet I can find it there. Print and repeat.

    Pointless waste of trees/computer space.

    Pointless Document Format!

  10. 's water music Silver badge

    Re: Print to PDF

    it's not that simple as virtual pdf driver (I use it every day to save beeb articles for kids). Trouble is, they print EVERYTHING, when all you want/need is the specific text. Not the links on the left, not a photo SPLIT into two pages (as it the norm with pdf printing). Not that they want people to print pix! :)

    I find the tools at printwhatyoulike.com quite helpful for this. You can easily mark regions/objects to exclude. I have used the Chrome addin and the JS bookmarklet

  11. Doctor Syntax Silver badge

    Re: Print to PDF

    "Oh! Look its still on the fucking internet I can find it there."

    Except when it isn't, not even on archive.org and assuming I only need to be able to see it when I have an internet connection.

  12. Roland6 Silver badge

    Re: Print to PDF

    >I find the tools at printwhatyoulike.com quite helpful for this.

    I've used the extension/add-in Print Edit WE (Chrome/Firefox) for this job.

    However, in using such tools, you do get an appreciation of just how varied HTML is and the ease, or not in some cases, with which a webpage can be reduced to it's substantive content. I'm sure this variable quality of HTML plays havoc with accessibility tools.

  13. Anonymous Coward
    Anonymous Coward

    Re: Print to PDF

    And, as someone who's been clearing up the estate of a recently deceased relative, I can tell you PDFs have their place. That's one job that would be a LOT easier if the mountain of paperwork had been scanned and OCR'd to PDFs rather than being randomly shoved in boxes over 30 years.

    Re: the Internet connection, we're not there yet. When access is 100% ubiquitous, cloud services manage to run years with zero downtime, and companies don't bail out of providing their services with almost no notice.... then offline PDFs and paper will have had their day. Today, the reality is that you just cannot rely on accessing online information when you need to.

  14. A Non e-mouse Silver badge

    Been there, done that

    One of XSLT's selling points was that you could take some XML structured data, and with the relevant XSLT files, you could transform into HTML, PDF (Via Formatted Objected), RTF, etc.

    The reality of it is a tad bit harder, though.

  15. Doctor Syntax Silver badge

    Re: Been there, done that

    "take some XML structured data"

    If only HTML were XML structured data.

  16. stephanh Silver badge

    Re: Been there, done that

    "If only HTML were XML structured data."

    That would have been nice. But since XHTML went nowhere, it isn't.

  17. Arthur the cat Silver badge
    Trollface

    Re: Been there, done that

    If only HTML were XML structured data.

    If only XML data was Lisp S-expressions.

  18. Destroy All Monsters Silver badge

    Re: Been there, done that

    If only HTML were XML structured data.

    Nothing that a deep-learning neural network can't fix.

  19. Aladdin Sane Silver badge

    "But we've always done it this way"

    When "This way" happens to be fucking shit and no longer fit for purpose.

  20. nuked

    'Long term roadmap' aka the list of things we will never do because by the time we find money for it the next genius would have taken over with a new strategic platform idea.

  21. Symon Silver badge
    Flame

    “This work is downstream of some higher priorities, but is on the long-term roadmap.”

    Aren't these people meant to use plain English? What's more, things that are downstream get to the sea first. It's a shite metaphor. Perhaps these wonks could try reading more Hemingway and less Dan Brown.

    As for using plain English, they currently have more urgent projects, but that work is included in their plans.

  22. Anonymous Coward
    Anonymous Coward

    Re: “This work is downstream of some higher priorities, but is on the long-term roadmap.”

    I think the Old Man and the Sea is still in copyright. If so, there are better ways of getting it than that link.

  23. Symon Silver badge
    Paris Hilton

    Re: “This work is downstream of some higher priorities, but is on the long-term roadmap.”

    As in come to Canada?

    https://gutenberg.ca/ebooks/hemingwaye-oldmanandthesea/hemingwaye-oldmanandthesea-00-h.html

    p.s. https://www.whois.com/whois/arvindguptatoys.com

  24. Doctor Syntax Silver badge

    Re: “This work is downstream of some higher priorities, but is on the long-term roadmap.”

    "Aren't these people meant to use plain English?"

    Whatever gave you that idea? This is GDS. Whatever they're doing it has to be buried under the most opaque mounds of gibberish to stop anyone finding out.

  25. Anonymous Coward
    Anonymous Coward

    Re: “This work is downstream of some higher priorities, but is on the long-term roadmap.”

    Probably something from "Heart of Darkness". It would make sense.

  26. steelpillow Silver badge

    History repeats itself

    Are you sure this is .gov.uk and not Wikipedia? The same leisurely approach to customised HTML-to-PDF conversion is under way in both, and Wikipedia have made a .gov.uk-style ballsup of their first two stabs at it (wrt stephanh's comment, round two was a fruitless attempt to make headless Chrome fit for purpose) and in desperation have outsourced Round Three to their book publisher. It's so the same story in different clothing.

    HTML5 sucks in more ways than most folks, including .gov.uk, realise >cough< offline >cough< javascript >cough< information layout >cough< and pdf, done properly, has a lot going for it in its own niche. But I have to ask, if dual-media publishing from a single source is the aim, then why fuss about accessibility of the pdf when you can have to flippin' access the html edition in the first place in order to get to it?

  27. Anonymous Coward
    Anonymous Coward

    Re: History repeats itself

    "why fuss about accessibility of the pdf when you can have to flippin' access the html edition in the first place in order to get to it?"

    Probably related to the Public Sector Bodies (Websites and Mobile Applications) Accessibility Regulations 2018.

  28. Andrew Yeomans

    Multi-page documents

    The other advantage of a *good* HTML to PDF system is the ability to select multiple web pages, and combine them into a single PDF document, with sections in the correct order.

    For example, try to print the NCSC CLoud Security Principles starting from https://www.ncsc.gov.uk/index/topic/151. Similarly try printing appropriate employment and tax pages. The next trick is to make it print double-sided.

    I have - once- come across a system which would let you select the desired sections of a larger set of documents, then it would generate a single PDF of them all, in a suitable format for printing.

  29. Roland6 Silver badge

    Re: Multi-page documents

    I have - once- come across a system which would let you select the desired sections of a larger set of documents, then it would generate a single PDF of them all, in a suitable format for printing.

    Expert PDF from Avanquest used to have this feature. There were times when it was really useful, for example taking a printout of a shopping basket (full list of part numbers and descriptions of content and prices) and then a printout of the final checked out order (basic item details and pricing).

  30. Peter Prof Fox

    Stand alone, reliable documents

    A PDF is self-contained.

    An HTML document has css links and scripts (and trackers)

    A PDF can be reliably printed and passed around. (A lot of people are not digitally agile)

    An HTML document requires a computer/device, browser and the knowledge to use it. A hard copy to get signatures on is pot-luck.

    A PDF can be reliably stored as reference. I have it. I can archive it and index it.

    An HTML user manual (say) is moved, deleted, or updated to reflect model 2 features but not my model 1

    What they should be doing is banning Word documents.

  31. Tom 7 Silver badge

    Re: Stand alone, reliable documents

    A PDF is out of date the minute it is made. The online HTML should be up to date, should be legible on phone, tablet or PC and does not self shuffle in the print tray or drink coffee.

  32. Pascal Monett Silver badge

    Re: Stand alone, reliable documents

    A pdf also requires a computer/device, the knowledge to install a PDF reader and the ability to use it.

    If you're talking about printing then I don't care if you printed from a web page, a Word document or a PDf - it's printed and that's the end of the problem.

  33. ibmalone Silver badge

    Re: Stand alone, reliable documents

    A pdf also requires a computer/device, the knowledge to install a PDF reader and the ability to use it.

    Because I so often read HTML over the air and straight into my brain without using any device or software.

    I don't know any current consumer OS that doesn't have a PDF reader. Windows - Edge does it. Linux - KDE has Okular, Gnome has Evince. Mac OSX - Preview. Android has Google's pdf reader. Both Chrome and Firefox will have a stab at it on desktop OSes.

    In practice, PDF is handy for archiving documents. HTML doesn't work as well because in most cases it requires storing resources alongside it (though yes, you can base64 encode images and stuff them in), and how browsers interpret it changes over time, while display of PDF is more stable and there is the PDF/A standard. Whether the resulting document is accessible / searchable largely depends on the source document, if it was structured text (LaTeX, markdown, office documents, XML, and yes, even HTML) with a sensible interpreter then the resulting PDF can be accessible. If it was scanned pages of an article from 1950 then no, but the HTML version isn't going to be either.

  34. Doctor Syntax Silver badge

    Re: Stand alone, reliable documents

    "A PDF is out of date the minute it is made."

    A fact which is extremely problematic for those in govt. who might have a shifting relationship with what they said a minute ago and very handy for those who want ot hold them to account.

    TL;DR? Permanence has value.

  35. phuzz Silver badge
    Facepalm

    Re: Stand alone, reliable documents

    I don't know any current consumer OS that doesn't have a PDF reader. Windows - Edge does it. Linux - KDE has Okular, Gnome has Evince. Mac OSX - Preview. Android has Google's pdf reader. Both Chrome and Firefox will have a stab at it on desktop OSes

    Half of those you list are actual web browsers. You know, software designed originally to parse HTML?

    At this point we can safely say that HTML and PDF are (roughly) about as easily accessible on any electronic device as each other. Not least because that most of the software for reading HTML will also display a PDF and vice-versa.

    Mind you, basic HTML is at least somewhat human readable in a text viewer, which isn't something you can say about PDF.

  36. ibmalone Silver badge

    Re: Stand alone, reliable documents

    Hardly the point really. I was pointing out that the idea it's hard to read PDF belongs back in 1995. And yes some of them are web browsers (making "about half of them" if you include firefox and chrome which I tacked on as additional examples of software you almost certainly already have).

    You'll find those web browsers also display images, video, audio, plain text and will have a stab at displaying XML. Is HTML a substitute for all of those? Will the available version of those browsers display the same HTML document the same way next year? If you're displaying plain text why not just use plain text? Or markdown? It turns out different tools have different uses.

  37. xyz

    The way government works is this....

    Oxbridge types (of the blue sky persuasion) do not use computers; they want a hard copy (emails, info etc) from their "girls" and still give dictation.

    SPADs and other assorted climber-upers only believe in something if it's in Excel.

    Managers tell their "girls" to type stuff into Word, save as pdf and slap it on their intranet page.

    Only "girls" (and other data entry types) use "working class" html.

    You can bet that behind this "necessity" is some crusty who wants his "girl" to send him an email with a pdf attachment so he can print it off.

    To give you an idea of the arseness available... one top dog was on hols in France and was viewing a 320 page document on the UN web site, he wanted a copy so he phoned his "girl" in London and told her to print it off and fax it to him. I am not joking.

  38. Doctor Syntax Silver badge

    Re: The way government works is this....

    SPADs and other assorted climber-upers only believe in something if it's in Excel Powerpoint.

    FTFY

  39. tiggity Silver badge

    XML

    Store underlying data as XML - nice and simple content with some basic description

    Run appropriate transform(s) to give HTML (the descriptive elements in the XML give appropriate HTML)

    Run different transform(s) to give PDF.

    Things like XSL-FO are your friend

    It works nicely (did some proof of concept stuff on this ages ago, back when mobile devices had weedy screens, - same content gave desktop HTML, mobile HTML and PDF by running appropriate XSL)

  40. ZanzibarRastapopulous Silver badge

    PDF is the work of the devil...

    So a natural fit for the civil service.

  41. Arthur the cat Silver badge

    Re: PDF is the work of the devil...

    So a natural fit for the civil service.

    To get theological, the Civil Service are so "on the one hand, on the other hand …" the Devil would reject them and they'd end up in the Vestibule of Hell, chasing deviceless banners and being stung by hornets.

    I want a Dante icon.

  42. Cem Ayin
    Boffin

    If your only tool is a hammer...

    Both formats have their strengths and weaknesses; wise guys choose whatever suits the job at hand best.

    Yes, PDF /is/ print-oriented - and that's a major advantage for publishing long texts that require attentive reading. A document set in a reader-friendly font with proper paragraph filling and hyphenation is so much easier on the eyes; it lets your mind focus on the content rather than the technicalities of a poor text rendering (which is the norm in HTML). I speak from experience, I do read a lot.

    And I'm not alone. I work in an academic setting and at our lab, the computing devices most in demand ("high demand" being defined as "users scream /immediately/ when it fails") are 1. the personal laptop and 2. the workgroup printer - and that's for a reason. /Nobody/ would want to read a scientific paper as HTML on the screen, with the poor rendering constantly distracting the mind from the problem at hand. (Some folks do use rotating monitors for reading papers, but it is PDF they read on the screen in portrait format.)

    And I haven't even mentioned the problem of embedded figures yet: good luck with copying the full content of a HTML page (skipping unneeded navigation code) for offline reading...

    That is to say, there are use cases where HTML is simply no go.

    The optimal use case for HTML (plus JS where that really makes sense) on the other hand is short, frequently changing or short-lived documents that noone would want to read offline or in print; or documents of a highly interactive nature; or reading the same document on a wide range of display sizes (making allowances for the text layout and rendering) - that's what it was designed for after all.

    Bottom line: Use a hammer for nails and a screwdriver for screws. Heated ideologic debates as to whether screws are outdated and should universally be replaced with nails are frankly daft.

    (And yes, both formats have rather more than their fair share of warts. A text format that is versatile enough to cover both use cases would be really nice to have. Good luck with developing something of the kind *and have it widely accepted by your audience*...)

  43. Anonymous Coward
    Anonymous Coward

    Re: If your only tool is a hammer...

    15th Standard: https://xkcd.com/927/ :)

  44. Nick Kew Silver badge

    Not reinventing the wheel

    As many commentards have noted, this is a frequently-solved problem. A decent minority of historic HTML/PDF solutions take the accessibility issues seriously.

    I expect what the gov.uk chap means is that they'll take some such thing - probably XML-based - and integrate it into their own publishing.

    That is, unless and until such a sensible goal gets lost under a weight of empire-builders and PHBs.

  45. Doctor Syntax Silver badge

    Re: Not reinventing the wheel

    "That is, unless and until such a sensible goal gets lost under a weight of empire-builders and PHBs."

    This is GDS. Of course that will happen.

  46. SVV Silver badge

    What's the betting

    that all the senor civil servants at the Government Digital Service require every document to be printed out for them on paper? Rather than reading them digitally?

  47. fpx
    Facepalm

    Offline Reading

    I frequently print web pages to PDF for storage and offline reading. In my experience it's not the PDF that goes out of date but the online content disappears or gets modified. The "1984" experience where information is centrally controlled and modified as necessary is easier to pull off every day.

    The article says that "most [PDFs] come into existence because designers want total control." Unfortunately that's very much the same for Web content, where every element is arranged down to the pixel, images are deferred-loaded so that they can track when and how far down you scroll, and random ads appear all over the place as you move around.

  48. Registered Register Registrant

    HTML sucks to read

    HTML is practical, quick, and dirty, but as Cem Ayin suggests, it sucks to read. A DVI file typeset with TeX in 1982 still reads better on a desktop monitor than any Wikipedia or gov.uk page today, 36 years later, and any serious reader would prefer a well-typeset PDF on the screen. Serious reading is not some artifact of "ingrained print culture".

  49. David 164

    Let be honest most of it is because the people writing those PDFs are force to make them public by law or convention. If it was their choice they would be printed off and locked in some filing cabinet where members of the public or the media or even MPs would have to fight through a pile of bureaucracy to get to them.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2018