For a while now, IBM has had multiple and competing tools for managing AIX and Linux clusters for its supercomputer customers and yet another set of tools that were used for other HPC setups with a slightly more commercial bent to them. But Big Blue has now cleaned house, killing off its closed-source Cluster Systems Management …
Only just got used to CSM!
It is a mistake to assume that xCat is built from the ground up. It still uses underlying components that are currently used by CSM, including NIM (NIMoL for Linux) for system image deployment and RSCT for monitoring, and they all revolve around other well known systems such as NFS, Kerberos, and rsh/ssh (to say nothing of the open source components ).
It's true that the overall gloss on the top is new, but many of the bits under the covers are the same. It's interesting to also see that IBM Director still uses NIM for Power/AIX systems.
I must admit that I believe that the switch from PSSP to CSM was a bit of a dogs dinner (I understand the architectural reasons for the change, that PSSP was designed around the constraints of the AIX SP/2 which were too rigid for the Cluster/1600 offering) . It took a couple of years for CSM to even approach the usability that PSSP had, and I fear the same will be true for the switch from CSM to xCat2.
The real problem is that often the people who write the code often do not work in the real world, and end up making assumptions about the shape of the systems. This means, once you take into account the various networking, commercial support and security restraints placed on real-world systems by people like Government Agencies and the Financial organisations (the people most likely to deploy large scale commercial clusters), very often the management tools, as delivered out of the box are about as useful as a perforated condom.
Even in HPC environments, it is comparatively unusual to have the systems configured exactly as the vendors suggest. I'm currently working with a large Power6 HPC cluster, and the requirements for outward event reporting to an enterprise reporting system that is NOT Tivoli are causing more than a few problems, along with security, access and data control that IBM had no real incentive to architect solutions for. As a result, it is necessary to dig under the glossy covers of the management and deployment tools using whatever can be found to implement what is needed.
All I can hope for is the fact that xCat2 has come out of Alphaworks means that real system admins have had input to the requirements and may have spotted any potential problems, but I wonder how many 200 Power node clusters have been deployed anywhere.
I'm still not very happy about having to learn another clustering tool, though, and I've still got IBM Director to contend with in the future for non-HPC clusters.
Guess CSM customers will have to migrate
On one hand, I feel sorry for all the IBM CSM customers who now have to migrate to a new software stack, on the other hand, IBM is really doing them a favor because there are so many better software choices out there, Sun's own HPC software stack, www.sun.com/software/products/hpcsoftware/ or
Unicluster from Univa UD being two good options.
Quote, "It is a mistake to assume that xCat is built from the ground up."
xCAT 2 was/is built from the ground up. In 2007 the xCAT 1 and CSM teams merged. We defined a new framework based on our combined experience and then set forth to build it. xCAT 2 is the best of both worlds (CSM and xCAT 1), and is all new code.
Quote, "It still uses underlying components that are currently used by CSM, including NIM (NIMoL for Linux) for system image deployment and RSCT for monitoring..."
NIM is not used to provision Linux. Each OS has its own unique and native solution. E.g. Kickstart for RH, Autoyast for SuSE, Windows (something) and ImageX for Windows, NIM for AIX, etc...
RSCT is not part of xCAT 2 and not required. However it can be use with xCAT 2 if desired. Many AIX shops do this.
Quote, "The real problem is that often the people who write the code often do not work in the real world, and end up making assumptions about the shape of the systems."
I have been designing and deploying some of IBM's largest HPC systems for 10 years.
Quote, "All I can hope for is the fact that xCat2 has come out of Alphaworks means that real system admins have had input to the requirements"
xCAT is an open project. All feedback comes directly to the developers. You can provide any input via the mailing list or the SourceForge site.
Quote, "I wonder how many 200 Power node clusters have been deployed anywhere."
The LANL Roadrunner system (Top #1 system at 1.1 PF) has over 6000+ Power-based Cell blades that boot with OpenFirmware just like any other pSeries machines. The entire system is managed with xCAT 2.