A recent update to the Point-Of-Sales (POS) software my organization uses also had the potential for some very serious disruption. The software in question is fairly decent stuff as far as POS software goes. It does the job and the features cover most of what we might want it to do. This software however has traditionally had a …
You're running your POS systems on Windows?
God help us all. I'm going back to cash.
Or a related note: Why did my server die? A disk failed in a RAID1 system. Now I'm getting input / output errors when I try to do anything including "ls /proc" Although "echo /proc/*" works. It's only a web server and everything's cached so I'm leaving it up until I can work out what's going on. Any ideas?
So you commented to:
1) criticize this guy for running a system on Windows, even though it's worked for him and he's been able to identify why a major failure happened, and
2) admit that the unix-variant you're running currently has a major failure that you can't identify.
POS on Windows
My local Walmart shit-hole in South Bristol has 10 self-service check-outs of which only 5 or 6 ever seem to be up at a time. The others? Yes, you guessed, BSOD's or admin password entry dialog boxes on display.
It would be laughable if it wasn't so fucking annoying as virtually every other bastard wants to use them and not the (still, for a while) human (just about human, I mean) proper tills. While I'm on the subject, why the hell don't the pleb's walking around like officials from the Nazi party do something about the morons taking trolleys through self-serve?
I like Lidl; at least their veg and fruit stays fresher longer than the crap from As-mart (and there's a really cute Polish girl in my nearest store ;)
I would guess that
I would guess that your shell was memory resident, but the ls executable wasn't. Then if the disk is hosed, ls won't work, but echo will, since it is a shell builtin.
In other words, this sounds like the 'good' disks in your array are bad as well, or your controller is dead or confused, or the driver is confused. Or any combination of the above...
Good luck, I'm afraid you'll need it.
POS vs POS
POS is windows, or POS on windows?
Both work for me.
Though I do wish my 2008 boxes would crash a bit more to back up my prejudices...
I am new to this forum and trying to figure out this..
There's no connection between the two statements.
My box is running a free linux variant. It's only a webserver. If it goes down, there are plenty more. Even though the disk controller was fucked, the box was still up and did its job quite happily. Outside of peak hours, I rebooted the box at which point the issue was obvious. I couldn't diagnose on the spot because, as I said, the box couldn't access the disk (logs, tools, etc).
On the other hand, he's running Windows on a POS!
What "hardware resources"?
"we simply could not get it to consume more than 25 per cent of the hardware resources provided."
What does this mean? What hardware resources? CPU? Memory? Disk space? Disk I/O capacity? Network bandwidth?
It would require a really well tuned application to use all available CPU and all available I/O at the same time. Usually one or the other dominates...
The system would never really get above 30% CPU. It wasn't hitting the disk much, and didn't use the network for more than about 2-3%. RAM was about 40% in use. RAM wasn't all that active, so it wasn't a "very small chunk of RAM getting hit so hard that all the RAM bandwidth was being nommed" issue.
I poked at it on and off for four years. I never did figure out what I could possibly change that would make it actually consume more resources. There didn't appear to /be/ any form of bottleneck. Just a stubborn refusal to use what was provided it.
Truly and honestly the single most bizarre application I have ever had the opportunity to work with. With the sole exception of this one application, I have never met an app I couldn’t identify a bottleneck on.
Frankly, I can't see the 'problem' of 30%
If a server application never uses more than 30% of resources, and it doesn't seem like there's a problem with the system(slow response) I'd be happy.
It's actually a sign of good programming that the SW doesn't use or reserve more resources than it needs.
My guess is that the company that made the SW has changed tools, their head programmer or they goofed with the SW.
(Delivered a set of 'Debug' versions of the executables, got the optimisation wrong during compile or something)
As with networks, I also like to see low utilisation on servers. It means there's lots of spare capacity to handle 'weird' situations.
Under the old version that used 30% resources, a sample report would take 6 hours.
Under hte new version that uses over 85% of the resources, the sample report takes 5 minutes.
My question is this: why was the solitary VM allowed to dominate the entire VM server to cause the other VMs to crawl in the first place? A virtualization environment has plenty of controls to prevent a VM from hogging resources or infringing on the performance of neighboring VMs. First off, the CPU, Disk, and likely network "shares" (VMWare) should be set either higher, or at the very least, higher priority, than the report/test servers. Likely, the test server should be set to "low." Also, CPU utilization can be capped, which is also recommended, to prevent a single VM from stealing all the MHz allocation. These, I believe, can be edited on the fly too, so a VM that red-lines the server can be adjusted accordingly so the system can return to normal until the process finishes.
The VM was allowed to dominate the entire server simply because...it was a test server. The point of the testing is to see if/how patches and updates change the behaviour profile of an application. If the behaviour profile of an application changes, then the entire virtual instance changes and I then go back and recalculate my load balancing.
Virtualisation is a tool in a systems administrator’s arsenal, no different from any other. While resource constraints are a fine thing in a production environment, they make absolutely no sense to me in a testbed environment. I view my testbed virtual servers similarly to how I view calibrating my various test equipment: it is there to provide you with baselines against which you can measure issues in production.
Given the radical nature of the performance changes delivered by this patch, I would say that this application server is quite simply no longer a candidate for virtualisation. Put another way: my calibration tests determined that the tool I was using is no longer valid for the environment in which it must operate. Indeed, this incident makes me grateful that I /don’t/ run my testbed systems with resource constraints. With resource constraints in place I may never have caught this performance difference. Had I not caught it, we would not be able to take advantage of the vastly superior report generation times this patch enables.
More to the point…it allows us to simply remove from service several instances of Windows Server we had been using to run “report servers” in order to compensate for the slow report generation features of this application. This frees up licences that I can use elsewhere for other projects. All in all, win/win/win largely because I take the time to reprofile my applications with every patch by running them on an “unlocked” server.
Based upon the performance delta, I have already begun the process of sourcing equipment to properly physicalise this server. This patch will not be applied to the production environment until that production environment has been fully physicalised.
Oh dear, oh dear...
"I ran the system through what tests I could think of ..."
Do we have a mug shot of this guy so I can be absolutely sure he gets nowhere near my systems.
Are you sure that this story doesn't say more about the pitfalls of virtualisation? In particular in virtualising production and test platforms onto the same physical server?
Production and test systems were not on the same server. The test VM was on a test server with other test VMs. When the new patch finally "unlocked" the performance of the POS software, it flattened the test VM, on the test virtual server. The side effect was one of also flattening the various other test VMs on that same server.
No production systems were harmed during the testing of this patch.
While putting a test vm alongside a production vm is not necessarily bad, consideration has to be given to the roles these environments serve. Each one will have its own objectives and risks and that doesn't come across as being considered here, with no disrespect intended to the author. By definition a test environment is likely one egg you want to keep out of the basket but we don't know the specifics here and shouldn't assume. If the software vendor didn't flag the changes in this major version then that's poor practice and words should be had. And it sound like they're missing a trick around service offerings for customers such as the author. An interesting article, cheers.
Give it a rest
The way some of you go on you'd think you had never had an issue on a test environment, or had anything go wrong on a project.
I've got news for you, every company has had downtime due to admins thinking they are infallable, and that nothing in their environment could cause the issues.
At least Trevor is sticking his name to some of these stories and hopefully someone somewhere might learn something from them.
To the people going "Oh god why is this guy in charge of a IT system", how about you read the article?
It was a *Test* VM on a *Test* Server - nowhere NEAR the production systems!
Quote from article:
"Every other VM on the TESTBED server turned into molasses and phone calls started coming in from a half dozen different people demanding to know what had just happened to their TEST servers"
Anonymously - because I am describing a live system - I would like to add that too much testing is as bad as too little. One system my company develop for a third party undergos an extensive level of testing to the point where there is insufficient return from the testing to justify it. The Feds that audited the system stated (off the record) that the level of testing was paranoid - I just sighed.
There comes a point where the number of bugs discovered is not worth the effort put in. The law of diminishing returns.
Much of the testing and documentation has no purpose other than to placate an audit by the feared audit-nazis. Whats worse is, it is not the auditors that demand this level of testing (as demonstrated by the off-the-record comments by the Feds), it is the test-zealots.
All systems should have some level of testing. But god help us from all the testing zealots that seem to be think if it is possible to test for something then you should.
It is oft-stated that Health & Safety are the arch-enemy of progress. I do not actually believe this - but there is a very strong case against the test-zealots.
You know who you are!
Let's hope you will never work on air transportation or nuclear plant control softwares...
Sometime, there is no such thing as too many tests.
"Lets hope you will never work on air transportation or nuclear plant control softwares..."
"Sometimes there is no such thing as too many tests."
I have worked on the development and test teams of many avionics systems. And adequate levels of testing and documentation are applied. If you think that extra software testing should be done on some of these systems then you are just naive. I like to think I know what I am talking about when it comes to testing.
I think I can safely put you in the category of "test-zealot".
to those saying why windows
Just want to tell you that the checkouts in tesco where you have the operator do the work are mainly running on windows 95. and the stores that have self scan are only just using XP
another place i know of is still using windows 2000 on a til system that was put in this year.
It isn't a killer as this was a test server and the other testers should just man up and wait till the servers are running again. But the burning question on my mind is did you not have resource caps on the test VM to stop any one VM hogging all the hosts resources?
There are no resource caps on the test server because the test server is used to see what kind of resources an app consumes. I /want/ them fighting for resource contention, because it helps me profile any changes to the application. :D
Therefore, it would have been most helpful in the article to have written:
"I had willfully removed any resource caps and priority shares on my virtual machines to better see the resource demands of my VMs. It was this that caused all the VMs on the testbed server to flatline, not an inherent flaw in Virtualization Technology. I was asking for it."
Where was "a flaw in virtualisation technology" ever remotely blamed? Also: how exactly do you feel you have the right to tell someone else they "should" have resource caps in place on testbed servers? There are dozens of reasons not to and only a few very flimsy reasons why you might want to.
I don’t see how properly testing an application – in a testbed environment – in order to do things like application resource profiling (among others) is such an issue. If you feel this article is an attack on virtualisation as a technology, then I would like to suggest that you are more than a little overly sensitive about the topic. The article was about patch testing and conveying the concept that “just because something has always behaved in X fashion does not mean that it will continue to do so forever.”
Nothing in the article remotely talked about “a flaw in virtualisation technology.” Furthermore I most certainly /was/ "asking for it." That is implied by the concept of a TESTBED server. The purpose of a testbed server is to "run the thing and see if/how it blows up." I've never met a systems administrator who didn't equate "testbed server" with "asking for it to explode."
So some new developer found the sleep(0.01) code they put in a decade ago when they found out that the new servers were running their code too fast and they got race conditions ;)
- Mounties always get their man: Heartbleed 'hacker', 19, CUFFED
- Batten down the hatches, Ubuntu 14.04 LTS due in TWO DAYS
- Samsung Galaxy S5 fingerprint scanner hacked in just 4 DAYS
- Feast your PUNY eyes on highest resolution phone display EVER
- Wall St's DROOLING as Twitter GULPS DOWN analytics firm Gnip