Lately, everyone seems to have lined up to join the network failure party. In some cases, lax network security has been to blame. In others, upgrade issues coupled with fundamental design flaws have done the damage. An inability to cope with denial-of-service attacks by angry internet mobs has even resulted in disruptions to …
Good to see consideration of out of band management there: interesting use for a fax line. The emphasis on lab verification of changes is useful too. That being said, this sounds more of a wish list by someone who hasn't had to deal with either a) hard business realities or b) very complex network failures. I would hesitate to describe myself as any sort of network "guru", but I do build out and support data centres, offices, and WAN links of various sorts, including metro rings, and the biggest causes of long outages are usually poor design (usually too complex), poor software (Brocade gets a special mention here!), and carriers (no comment!): in that order. Doubling and tripling up redundancy can improve failure rates (although I should mention that most vendors won't load balance links properly with IGP's or port channels unless they can be divided by 2), but that ain't necessarily so. My worst outage involved a stray OSPF default screwing up a multi-homed site without out of band management: a single homed site would have had higher reliability over the same calendar year.
Networks is hard.
The problem with specifying two of everything, or three of everything is that the duplicates or triplicates aren't really identical copies of the original. They won't have the same MAC addresses and they almost certainly won't have the same IP addresses as the "hot" or production systems. They probably won't have the same network config/routing as their peers and it's highly likely that some of them will have different firmware, too.
Because each component of a network has to be unique, testing a new network prior to roll-out, or even of reliably testing a change in anything resembling the production environment, is very difficult indeed - I don't think I've ever seen anyone do it successfully, despite what they say or claim. The differences, even with an identical cloned sandpit, may be so large (without production network loads and replicating *every* piece of kit in the production environment) that the testing adds very little value is prone to false alarms and merely doubles the cost and duration of every activity.
The best approach is to compartmentalise everything. So a fault in one segment doesn't have any effect outside it's local domain. We know this strategy works - just look how successful it was for the Titanic. After that, make changes slowly - one piece at a time. And yes there is no substitute for actually being there.
Although I Am VERY Happy...
...to sell you 40 or 100 or 1000 megs of network, we continue to sell dialup because it IS an important backup to your other systems.
That beer is not gonna buy itself.
Remember dial-up modems?
Of course, you do need a serial port to connect it to
My modem uses USB.
There's an appliance for that
Dial-up is enough for serial console access, which is fine to fix up a routing oops at a remote site - but for RDP you're much better off using cellular WWAN. A northbound VPN or reverse SSH tunnel gives you inbound access using a cheap supermarket SIM without having to pay extra for a public IP.
You could hook up a dial-up modem or WWAN dongle to a bridgehead system as a homebrew OOB gateway, but then the problem becomes building something more reliable than the infrastructure you're managing. Otherwise a purpose built out-of-band management appliance with built-in dial-up or cell modem, serial console server and remote power control can be had for 300 or 400 quid.
Full disclosure - I work for Opengear who make aforementioned out-of-band management appliances.
It's not all about redundancy
As the first poster mentioned, complexity is the enemy of fault diagnosis and resolution as well as security.
If you double, or triple everything, you may be spending disproportionately to the problem (a single fibre has around 99.9% uptime, but to add a second diverse link will add £14,000pa to the bill - is a risk of 0.1% downtime worth £14k+?).
If you have multiple ISPs or core networks, circuits, routers, firewalls, load balancers and servers, tracking down a complex fault and solving it takes longer (particularly if there are multiple suppliers blaming each other). There is a middle-ground between protecting yourself from common device and circuit failure in a proportionate manner, and making a system so complex that if it fails in a non-standard way, tracking and resolving it becomes next to impossible - and the chances of avoiding introducing a security risk are slim.
As you scale, you must simplify.
"So go ahead and unplug that network cable."
Back in the day, as admin of a Token-Ring network, I used to do that on occasion. Almost always when some eejit held forth on the superiorities of Ethernet. It was always worth it to watch the evangelist in question go white as a sheet and make strangled noises.
Back then, having a network that didn't shit itself (or at least a segment of itself) when some luser decided to move their own kit was considered something of a novelty by most......
That's a bit, err... *antique* isn't it?
Not that people here don't remembercoax [thin] ethernet cable with t-adapters and those *absolutely vital* terminators at each "end" of the network --- or the even earlier, thicker, stuff with those strange "bee-sting" connectors, but ... it's been a while, and not many people would visualise that now if you said "ethernet"
So sure, go ahead: unplug that network cable :)
It isn't a coincidence that...
... "real" network equipment comes with two serial interfaces just for management, one of which is ment to be plugged into a modem set to auto-answer. In branch offices multi-purposing the line to a fax is fairly natural, though that solution isn't the only one possible. Back in the days of BBSes there were a few that you had to call, let ring twice, hang up, then call again. This worked pretty well and the software was available. Another way to achieve this cheaply is ISDN. One basic rate interface "line" gets you two usable data channels (64kbit in yurp, 56kbit in backwardia) that can be bonded at need, and up to eight numbers. One for the fax and one for the modem is then easily done. Of course, ISDN didn't really catch on in Blighty, for one because it was priced as two POTS lines (plus more expensive equipment) instead of as slightly over one POTS line elsewhere.
Then again, if that router is sitting in a closet somewhere then these tricks might not be an option at all and you might have to shell out for a fresh phone line that just sits there. This appears to be somewhat cheaper in some places than in others. Either way, compared to the price of the kit, the price of driving down there, and the price of an hour's worth of outage, the price of a phone line and a modem might be entirely justifyable.
"Either way, compared to the price of the kit, the price of driving down there, and the price of an hour's worth of outage, the price of a phone line and a modem might be entirely justifyable."
These sort of things are, of course, much easier to justify to bean-counters immediately *after* a prolonged outage, than when you're talking about a hypothetical outage...
Well, that can be arranged... by modem.
Anyhow, while it's not unreasonable to have to justify expenses, assessing the arguments requires domain knowledge that þe average olde tallyer of beanes just doesn't have. That in itself is a hidden source of mis-spending and thus costs. I say it would be interesting to find ways to fix that--being ignorance based, it won't fix itself.