Two corrections to this article
1. Google do not run Hadoop internally. They have Google FS, BigTable, Pregel and other things. The Apache Hadoop stack is evolving to be equivalent, but Google have their own stack, which predates much of Hadoop. The paper gets this right; it's the El Reg journalists who appear confused.
2. Bandwidth after the "MapReduce" stage is normally much less than ingress bandwidth. Hint, the word "reduce". This usually means squeezing down log data and the like to smaller summary.
Regarding the ingress/egress bandwidth, if all you are collecting is internal log data, you can predict the data rate (your daily click count, compressed), and its origin (your servers). Click log bandwidth will always be much less than site bandwidth, unless your site is something like bit.ly that just bounces 302 redirects back to the caller, in which case it's probably equal. Provided you keep the web servers near the Hadoop cluster, the cluster ingress bandwidth will be straightforward to handle.
The paper looks specifically at the problem of "classic" enterprises (i.e. pre-web), where systems are widely distributed for historical reasons; intra-enterprise traffic becomes a problem. This is probably the case when the application is itself distributed (telcos, banks). if your servers are scattered across 20 datacentres for historical reasons, you should consolidate down for cost reasons.
Despite these critiques of the article, the paper itself is pretty good.