Germany - an incredible concept?
From my perspective in the UK Germany looks to have a most enviable mindset.
See, for example...
A "highly unusual" additional parliamentary hearing on proposed changes to German copyright law is a sign that there is increasing opposition to the publisher-driven plans, an expert has said. A cross-party sub committee on new media is scheduled to stage a hearing of stakeholders' views on the proposed amendments to the …
From my perspective in the UK Germany looks to have a most enviable mindset.
See, for example...
I for one welcome any democratic government's attempts to curtail the thieving, tax avoiding, and criminal activitoes of Google. The web is NOT Google's private property, adding information to the web does not give Google an automatic right to steal it.
The days of freeloading by aggregators who provide very little content of their own should come to an end.
Anyone with basic webmaster experience knows that putting un-password-protected content on your website implicitly means you are granting access for world+dog to see it.
If you are OK with that but specifically want Google to keep its nose out of your content then use a robots.txt and Google will stay respectfully away from you.
Google says that they "respect" robots.txt - but can you give a concrete example of a website that has blocked Google that is invisible to Google? The reality is that google ignores robots.txt.
"but can you give a concrete example of a website that has blocked Google that is invisible to Google?"
My old blog was blocked to Google using robots.txt and the log data showed I only ever got the occasional hit by the google bot hitting the robots.txt file.
Of course, my blog just had a few personal notes which weren't really private but weren't really public either. Maybe Google wouldn't have respected robots.txt if I had anything interesting there and getting lots of hits.
You're the one making the improbable claim so perhaps you should provide the evidence. Give us a concrete example of a website that has blocked Google and which Google is nevertheless indexing!
The reality is that almost nobody wants to be "invisible to Google". However, I can think of a couple of examples where robots.txt is useful:
- You're putting old issues of a newsletter online but don't want to make it too easy for people to look up the names and address which are occasionally mentioned in the text.
- You've got a huge number of pages automatically generated to give multiple views of the same data. It's easier for web searchers if you just make one standard, canonical view visible to Google so that people don't find hundreds of almost identical pages in the results list.
What you suggest is trivial to test/monitor and not the case. Shouldn't you be in school?
I've never double-checked to see if Google actually scans but doesn't publicly list contents "protected" by robots.txt but from the sites I have seen Google does respect robots.txt. There are, however, plenty of search engines out there that don't.
Here is an example robots.txt - if you don't believe it got respected by Google then show us some disallowed content in Google's results.
My company has half a dozen live testing domains - all except the actual public domain are withheld from Google and other engines via ROBOTS.TXT - every engine from Alta-Vista onwards has honored the exclusions and the testing domains never show up in search results.
The reality is that you can't show a site where Google ignores a correctly formatted robots file.
I too naively hoped that Google and the other search engines would respect robots.txt. So, before posting a reply here to say "Of course robots.txt will do the job", I went to check Google for anything from my little web-site that is supposed to disallow all search engines.
Sorry, guys, but it's there on Google! Either my robots.txt* is broken or Google is displaying the original pages from the few weeks between the website going live and robots.txt being enabled, in 2005!
[And no, I'm not going to post the URL - it's supposed to be kept non-public, which does make demonstrating the issue somewhat problematic.]
* # go away
Regardless of robots.txt Google spiders the pages. Whether it displays the pages is another thing. For that you have to be adept at regexp and have covered each and every way by which a page 'might' be seen by a spider. Example I have a site with a page at example.com/private/mypage if I have a German translation of that page then it will be example.com/de/private/mypage if not then the url redirects to the untranslated page. For robots.txt to work, despite have the cannonical URL set, I have to make sure that every conceivable URL that might redirect to example.com/private/mypage is covered. For a drupal based site that will also mean that you've covered the non friendly URL example.com/node/1234 in wordpress that example.com/p=1234 is covered plus any variations of extra arguments etc, etc. On any reasonably complex site keeping Google out is impossible.
Google turned me up http://www.robotstxt.org/ as a resource.
A checker was promised but, "We currently don't have our own /robots.txt checker, but there are some third-party tools: •Google's robots.txt analysis tool (requires a Google Account) " - er, that's all. Still. email@example.com if you'd like me to see what I get from it - and you say Google is into your stuff anyway so what does it matter pointing them at it again?
As far as I can see, only the "User-agent: " and "Disallow": " lines should be in your file, and, of course, it should be 8-bit ASCII text and correctly spelled.
What about the result in other search engines?
One other technique, I suppose, is to fill your web site with adverts for Canadian pharmacies - or use freely advertised illicit SEO methods, which Google recognises and marks down - or report yourself to Google as an abuser, or, report yourself to Internet Watch, which campaigns against child pornography.
https://www.google.com/webmasters/tools/spamreport will probably get you deleted from Google, as you wish!
I think John Lilburne is right that the robots.txt has to cover all possible URL combinations. In further testing to check the dates of the web-pages on my site indexed on Google, the cached option shows that Google last crawled my web-site on 13-Feb! So my robots.txt is either broken or Google doesn't respect it. I'll modify the robots file and then monitor Google for a few years........
Just have your webserver block anything that says it's from GoogleBot.
Easy enough with apache.
The stories are indexed by Google, the comments are not. This is done in robots.txt
>report yourself to Internet Watch, which campaigns against child pornography.
Hmm... novel... methinks the "cure" could prove worse than the "disease"
> Regardless of robots.txt Google spiders the pages
It doesn't on my sites.
Google in its turn will say: "OK, we will not list any content of you in the search results"? I think in the end the publisher will suffer even more. And google could start asking money for indexing their website and showing it in the results. And as a side note, are these the same publishers who make their online money through Google AdWords?
Can't seem to reply to or download the drivel.
I for one welcome any democratic government's attempts to curtail the thieving, tax avoiding, and criminal activitoes of Google.
Typo's aside, tax avoidance is not illegal. It can be argued that the UK's synthetically low corporate tax is in fact an invitation to it, okay the EU tentacle of Google is registered in even more accountant friendly Ireland but for the same reasons. Google does not engage in theft. Be interesting to know what other "criminals activities" you think it's involved in, bear in mind that this would leave you open to charges of slander in the "any democratic government".
Not that I'm whitewashing Google - the collection of WiFi data certainly does not fall into the category "do no evil" but it wasn't necessary illegal either hence no criminal charges.
No, the problem for the newspapers is that they have for years failed to come up with a viable model for the internet. Laws like this or the DMCA are only sticking plasters that will invite abuse whilst at the same time fail to protect the underlying business.
John Lilburne is a known copyright troll down there with the likes of Turtle and PirateSlayer et. al. and, like them, probably an RIAA / MPAA staffer. His brain cell likely isn't capable of assimilating the concept of publicity and revenue provided by big search engines linking to content, because all it's capable of is something like, "duh, it's all mine and I don't want anyone to see it without paying, duh..."
Just ignore him. If you can be bothered wasting the few seconds it takes, downvote him if it helps you feel a bit better after being subjected to his drivel. ;)
There is no revenue from big search engines linking to content on creators websites. The price of an ad impression is a about 1000th of a penny. Revenue accrues to sites that have 1000s pages containing the most popular content, normally aggregating sites that simple scrape or pirate content and wrap ads around the content.
For newspapers the ad revenue is less than 1/10th of the revenue that they would have got from a print advert. Then what views there may be are syphoned off by scrapping and aggregating sites.
For newspapers the ad revenue is less than 1/10th of the revenue that they would have got from a print advert.
And the online newspaper pays less than 1/1,000,000 the cost of printing the advert to do so.
- I almost certainly didn't put enough zeros on that number, as the publication costs of online adverts are almost entirely borne directly by the reader of the advert (free) and by the advertising agency (already covered by the reduced revenue).
Yes, you need more throughput to get the same gross profit after paying the staff, but again, that's easier - compare the cost of printing and distributing 1000 newspapers to serving a website to those same 1000 readers.
It's true that many of the old ways of making money have gone. Tell that to the manuscript illuminators, they're the only ones who'll care.
The adverts pay for the content, if the single bit of content is being spread about a dozen or more outlets only one of which is creating the content, and in addition there is a 10th or more drop in the revenue per paying advertiser on the content creator website, then there is no business for creating the content. The content creator might just as well scrape the content from other sites.
What you end up with is a contraction in voices. You see it everywhere, where a company press release is simply recast and then copied across 20 or 30 other sites. When was the last time you saw any original content on HuffingtonPost, content that wasn't simply cribbed from elsewhere or a regurgitated press release?
The hivemind in tech journalism has reached the point where any negative press or critical opinion is viewed as either an aberration, or heresy.
Putting 'stuff' on the web is akin to leaving it lying in the street (much as this post is like me being in a Town Square, standing on a box and shouting this to anyone within earshot).
Putting a book out in front of a store, on the street, (or even displaying the book inside on a shelf) affords the passer-by to read a little of it (or even - shock! horror! - make away with it..).
Attempting to use the medium of the web to sell items that can be copied was always going to end in tears and tantrums. Especially when the business model was going to be based on charging the same for digital content as the hard copy and taking all the profit that would have been made by the retailer, distributor etc and keeping the savings from NOT producing hard copy. .... (which appears as 'greed' to the consumers).
Not sure if there is a solution..Search engines effectively advertise your stuff.. no advertising=less sales...
"Especially when the business model was going to be based on charging the same for digital content as the hard copy and taking all the profit that would have been made by the retailer, distributor etc and keeping the savings from NOT producing hard copy. .... (which appears as 'greed' to the consumers)."
Yes, it does seem as greed to the consumers, and, in some cases it probably is - if a book comes out as hard-copy and e-book at or about the same time, then there is no excuse for the e-book costing the same as the hard-copy. All editing and proof-reading will have been done (probably inefficiently based on new books I've read over the last few years) on the same electronic copy (I don't think there are any publishing houses that allow paper submissions any longer), and so they they should have exactly the same set of mistakes, but should be cheaper by the amount of printing and distribution costs (okay, I know in the UK there is a strange VAT problem afflicting e-books)
However, being an editor of e-books scanned from old books, I know that the extra work is significant, and I could see that someone might be able to justify charging for that. The problems come from poor scanning, misprinted characters, missing lines (which might have been in the original when dealing with old sci-fi, as I do), inserted characters where the scanning program has included a mark on the paper. This means that every word has to be scrutinised. However, what I am long-windedly getting to it is these are the books that people will usually NOT charge for - it is a hobby, if a quite sad one :-). People like me give their time freely for others to have a great reading experience.
"if a book comes out as hard-copy and e-book at or about the same time, then there is no excuse for the e-book costing the same as the hard-copy."
Today the cost of printing and paper distribution may account for a couple of quid if that. Hardback books I bought in the 1970s cost about £10-15, now some 35 years on and the price hasn't changed much at all.
I'll suggest three possible reasons.
1) German legislators have doubts whether the proposed act is unduly protective of the publishers or has some knock on effects that weren't properly considered
2) public opinion through the modern electronic medium has fed up through the democratic process that this would restrict ordinary citizens access to information
3) "G for Germany", Herr Minister (I checked this with a popular web-base search engine - so you should get the meaning I intended. PS this is a bit of joke, please don't sue me, American search giant)