back to article When should you bin that old mainframe? Infrastructure 101

It's very easy to forget that buying kit for your infrastructure is just the first step on a long, long road. It's also easy to forget that everyone keeps their infrastructure kit going for years longer than their accountants depreciate it: we've all got something in the comms room that's still clinging to life. And this is …

  1. Anonymous Coward
    Anonymous Coward

    SNMP - Ha-Ha

    Quote

    Make sure you monitor your kit. By “monitoring” I mean using a proper SNMP-based monitoring package that will alert you to the smallest issue. Although network kit is becoming increasingly commoditised, that doesn't mean it doesn't break. Most of the time things tend either to work or fail completely, but intermittent problems still happen.

    You are the the mecy of the device manufacturer here. I know of one pretty critical system in use at our sites that the SNMP MIB only contains one item, the CPU temperature. That's it. nothing else.

    Also a number of Subsystems that support SNMP require to be polled for staus information. They don't send alerts.

    The difference between the various SNMP monitoring packages out there is huge. Some work and others are a real PITA to configure and operate.

    So, Mr Cartwright, I know that you have a lot of experience in this area but nothing is clear cut as you might like to make out. In many cases SNMP is the last thing that some makers put into their systems. We've had to make full sNMP support a mandatory condition for anyone bidding to get their kit in our plants. You would be amazed at the list of so called highly reputable kit makers that don't have proper SNMP support in their devices. One even wanted to charge us $200,000 to put in in.

    Anon for obvious business reasons.

    1. Chris Miller

      Re: SNMP - Ha-Ha

      SNMP = Simply Not My Problem

    2. Roland6 Silver badge

      Re: SNMP - Ha-Ha

      SNMP just covers the basic's, if you want real monitoring, which is really a necessity with distributed systems and virtualised infrastructure then you need to invest in application instrumentation and measurement systems, which tend to be proprietary, even if they present some of their data in an SNMP friendly form...

    3. Anonymous Coward
      Anonymous Coward

      Re: SNMP - Ha-Ha

      I've managed without MiB's, but it's a painful walk though working out out each little detail.

      Also don't forget log watching, syslog and perfmon & WMI (if a windows box)

    4. JerseyDaveC

      Re: SNMP - Ha-Ha

      Yup, agreed, which is why I went on to mention that some stuff can't be interrogated with SNMP (e.g. the switch fan issue I mentioned). SNMP's a pain in the bum in a lot of cases (and the first word, "Simple", is a terrible misnomer!)

    5. Stephen McLaughlin

      Re: SNMP - Ha-Ha

      Some years back, we were managing equipment with little or no MIBs, so I went through the process of getting our company registered with the IANA, created a database structure for the private enterprise numbers (1.3.6.1.4.1.X), wrote the programs, and implemented the processes to alert HP Openview with alerts. Only to have management decide a few months after this went into production that SNMP had to be turned off because of security concerns. grrrrrrrr...

  2. The Islander
    Boffin

    I think the headline ...

    ... does not reflect the content of the piece. In this day and age, I submit it is highly unlikely - but not impossible - that mainframes would run foul of this checklist. If anything, they would typically be to the fore in engineering terms and are very often the subject of extensive documentation.

    All that said, I fully agree with the underlying points made.

    1. Naselus

      Re: I think the headline ...

      It did say 'old mainframe'. So I'd assume hes referring to the ancient god-computers from the mid-80s that some companies insist on keeping alive rather than a modern HP supercomputer from 2010.

      1. Doctor Syntax Silver badge

        Re: I think the headline ...

        "the ancient god-computers from the mid-80s"

        Those would be the ones with the core applications that run the business that brings in the money that pays everyone's wages. The ones everyone born later than the mid-80s sneer at.

      2. Lusty

        Re: I think the headline ...

        Some companies have to keep older kit alive. Sometimes the old kit still does the job it was designed to do perfectly. Sometimes regulatory requirements mean that assuring new systems is time consuming and difficult, and so isn't done on a whim just because some new shiny is available. I for one am quite happy for nuclear reactor safety systems to be left alone rather than replaced arbitrarily with DevOps rich Javascript change heavy nonsense which breaks when a developer throws his toys out just because it's flavour of the month and someone with no experience of safety systems declares that old stuff needs replacing because it's old. If there are spares and support available for a period longer than required to replace something there is sometimes no good reason to replace it.

        1. Stoneshop

          Re: I think the headline ...

          Some companies have to keep older kit alive.

          We have such a case. It's five sets of two master/slave systems, essential in controlling safety-critical systems. These master/slave systems are not safety-critical themselves, but if both members of a set fail the actual control systems will go into a safe state, which means stuff gets halted. And if stuff gets halted there will be ministerial questions.

          As the hardware these systems run on will be out of support end of this year, and the software for the replacement systems will not be ready until mid 2018 at the earliest, we have to keep these systems alive for two more years with most of them already having done eight years running 24/7.Fortunately we have at least 20 similar systems, decomissioned over the past year and a half. They will be configured to be identical to the current sets, tested, and then five of those will be put in the computer rooms next to the existing sets, one spare per set. Up and running with just the OS and monitoring software, so that we're not going to get caught with one of the active set failed and then trying to boot a spare and find that it won't. Each of those spares will be set up so that they can take over as either of its associated master/slave set, and all of this can be done remotely. The only manual operation necessary is the moving of a crossover cable from the failed system to the newly active one some time in the next 48 hours or so. Then one of the remaining systems will get taken from stock, tested, set up as the new spare for that location and moved there.

          I think we'll manage.

          1. Code For Broke

            Re: I think the headline ...

            How difficult was the decision to keep the spares running 24/7? I understand your logic, but I might have thought it better to not stretch your MTBF. I might have favored a monthly test cycle and then shut down of the spares.

            1. Disk0
              Boffin

              ...shut down of the spares.(?)

              Spares would generally not be operating under (full) load until pulled into deployment, they may even be idling altogether, so they're not likely to burn out as quickly as the system they are a spare for.

              Switching off an arguably essential system makes it harder to manage, and you never know if it is going to act as expected once (if) it is booted. There are many things that can go wrong even with a system that is powered down - what if a network cable comes unstuck, a port fails, some rust refuses to spin, the internal clock drifts or a configuration file doesn't load properly...? Now you're troubleshooting the very system that should be making your life easier. Shouldn't any system that is part of your critical infrastructure be available at all times?

              1. Roland6 Silver badge

                Re: ...shut down of the spares.(?)

                There are many things that can go wrong even with a system that is powered down

                Don't disagree, however, I've also encountered problems with systems that have been continuously up that have been power cycled, only not to come back up...

                I think in your circumstances having some boxes powered down is going to be useful, so as your hot replacements get used, the cold replacements get powered up.

                1. Stoneshop

                  Re: ...shut down of the spares.(?)

                  Don't disagree, however, I've also encountered problems with systems that have been continuously up that have been power cycled,

                  Because of the vagaries of several of the software components these systems need a full reboot when being promoted from spare to slave (if the active master konks out, the active slave will become master), but they don't need a power cycle.

            2. Stoneshop

              Re: I think the headline ...

              How difficult was the decision to keep the spares running 24/7?

              Not at all. Mind, there are just five in the datacentres set up as warm spare (OS running and monitored, but not otherwise active), with fifteen more on a shelf in central stock.

              And our rule is: if it's in a DC, then it's to be monitored. If it's to be monitored, it obviously needs to be booted up and running an OS. End of discussion.

              Powering up once a month will very likely kill at least one, and probably more over the next two years. Letting them run not being touched until required will be the much better option, in our experience (the active master/slave systems don't have much of a CPU and I/O load either).

    2. Anonymous Coward
      Anonymous Coward

      Re: I think the headline ...

      We turned off our last pre-2000 VAX system last year (our last customer upgraded away from VAX and moved to a Linux server).

  3. chivo243 Silver badge

    In previous lives

    My office was right next door to the server room. We had a Dell PowerEdge 18xx and one morning, I could hear a drive making a bit more noise than usual. Now I'm way down the hall and to the left....

    The oldest device in the rack is an HP DL360 Gen8 I think, might have it's 6th birthday sometime this summer.

    There is a Macintosh LC475 sitting on our service counter as decoration/conversation starter. Nobody knows the passwords from 1999 ;-}

    1. Disk0
      Pint

      Re: In previous lives

      to get in, try starting the LC with the shift key pressed, this will boot the system with just the bare necessities.

  4. Sir Sham Cad

    Support contracts

    Very good points about support contracts. If we had a support contract for all of our user-level (edge) Cisco switches it would cost a lot more than the cost of replacing ones that go bang annually.

    The more important bits, however, do need 24/7/365/4 hour cover and ye gods is that a PITA every year.

    1. Anonymous Coward
      Anonymous Coward

      Re: Support contracts

      I've also had the situation where the service won't be paid, because "it's just a firewall, why does a firewall need updates?"

      Me: The firewall is 6 years past its end of support.

      PHB: But it is still working and we never installed updates during the time it was supported. It is a firewall, it doesn't need updating.

      The same went for the servers running SUSE from 2000 and a Windows 2000 server...

  5. b3stbuddy

    Support

    I agree with about 50 percent of whats said on the contracts. Though some may not need 24/7, in a large datacenter a good crew on hand via a 24/7 contract is essential when you have the overlap of dozens of complex systems. Otherwise you will lose critical time trying to troubleshoot issues. You can't be an expert in everything. I don't see how having more eyes on a problem would be a bad thing.

  6. Robert Moore

    Standard admin password??????

    "be sure that nobody else can get hold of something they could use to deduce (say) your standard admin password"

    It is 2016. WTF are you doing having a standard admin password?

    If you have to have one, for dogs sake why are you storing it anywhere unencrypted?

  7. Pirate Dave Silver badge
    Pirate

    Change Management

    For my switches, I use some expect scripts to login and download the current config to a text file. Then I check that file in to RCS*. Then backup the whole kit and kaboodle to tape/removable. Not only do I get to see what's changed over time, but when it changed, and what it was before it was changed.

    *Ok, so maybe RCS isn't the best choice, but it's dead simple to use.

  8. Disk0
    Thumb Up

    Some good suggestions

    I particularly liked the point about the lowly office switch possibly deserving of it's own support scheme. All kinds of minor equipment can be a single point of failure, and might just be deserving of its own support scheme. Another thing, no matter how luxurious an outside support contract may be, all of it still depends on someone with the right knowledge and the right equipment/parts being available at the precise moment you need them, a dream that doesn't always come true.

  9. Mike 137 Silver badge

    time dilation?

    "24/7/365 support", Talk about over-working the team - 24 hours a day, seven days a week, 365 weeks a ... oh, hang on! Something's not quite right here. In the real support world, you provide either 24/365 or 24/7/52.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like