“Ring, ring...” “IT support. How can I help you?” “The system’s really slow. It’s taking about a minute to save stuff. Normally it’s only a few seconds.” “How long has this been going on?” “About 20 minutes, and the backlog is building up. It slowed down yesterday too, but then seemed to right itself after about 10 minutes so …
The simple solution
The basic problem is that IT staff tend to only monitor stuff inside the datacentre - not the user experience. That might have been sufficient 20 years ago, but nowadays the only reasonable approach is to start with the users' experience and to work backwards from that.
The reason we have that situation is that the tools that come with most IT systems are really only designed for monitoring servers. Looking at a small number of parameters: CPU, memory, disk I/O and network traffic. Any dashboard is usually simply plopped on top of these metrics.
We tend to value what we measure, rather than measuring what we value.
In fact it's quite easy to do the right thing. Even better it can be done for free. Using packages such as AutoIt3 for windows or Tcl/Expect for Linux/Unix, it's quite feasible to measure response times that the users experience, or the times that queries they execute take to run. We've been doing that for a long, long time and it usually provides exactly the information needed, quickly and accurately.
With the proper analysis, it's possible to see quite small deviations from normal response times. Generally users are prepared to put up with quite a lot of pain before they'll pick up the phone to the Hell Desk and report anything, so with these techniques it's perfectly possible and practical to know before they report it that there's a problem looming.
And so on, through the whole infrastructure
This stuff isn't "that" hard. Collect the data as below, push into "something" (Splunk, Nagios, a spreadsheet, a piece of paper, etc.), and have "something else" (email, IM, pager, dullard co-worker, etc.) let you know when it looks "weird" (abnormal, outside SLA, missing, etc.).
1. Pick a common *business* process which touches many points in the infrastructure.
2. Figure out how far through the process the initial correlation data (transaction ID, etc.) will be carried.
3. Time entry and exit of the correlation data from outside the system (any scripting language can be used for this, or a stopwatch for the technically incompetent).
4. Repeat through the business process.
5. Pick another business process that touches (generally) different infrastructure.
6. Goto 2.
when the system is running slowly
> The PC on the desk is running a web browser that connects .. So when the system is running slowly, where do you begin to investigate?
I don't need to investigate I know exactly where the problem lies. It because network traffic has to go through a NAT server, a proxy server and a firewall before getting to the desktop where it then gets filtered by the anti-virus application. All in all this slows usability down by about 50% and the local webapps are next to useless because of this.
Budgets and metrics
The problem is often in IT management. Most existing helpdesks have their workload and performance measured through a simple issue-tracking application. Team performance is evaluated based on how many calls per hour they can resolve. In that environment, there is no incentive for helpdesk operatives to undertake preventative maintenance - in fact the incentive is to wait until things blow up, then keep themselves busy for hours resolving the crisis.
It's even worse for outsourced services. Your boss's boss has signed a contract with another company's boss, to provide a hosted (cloud / outsourced / managed / insert buzzword) service. When something goes wrong, you (the mere peon) call the remote helpdesk (also mere peons) and complain that it's "a bit slow". Their response is typically "oh it's just like that", or "it's always slow on Tuesdays"; when in truth they've massively underprovisioned their system. However you have zero leverage over them and even your boss doesn't take much interest. Even if he did, you're locked into a contract for years and all your data is on their system Result: lost productivity, wasted time and money.
Budget constraints and performance metrics are part of the problem, yes. However, an even larger elephant in the room is that of compartmentalization. Often, you'll have a group of IT guys who specialize in one thing. You ask the "Network Infrastructure" guy to fix a syslog error on the Win 2003 server and he'll likely not know how. Ask the server guy to fix a syntax error on a jsp page and he won't know where to start. Diverse skillsets in even one person will help diagnose a "slowness" issue more than having 3 whole teams on the matter. Why? Because, chances are, they won't be able to see outside their scope. They may not even know you're running that tomcat webapp server in a VM that was only given a single vCPU without any MHz reservations, but happens to coexist on the same physical machine as the reporting server...
You get things like the user swore (the fucknig system is shit slow)
Most 3rd parties can hang up the call there and then (abuse)
In fact if they DO NOT HANG UP when you swear they will lose their job!
So the entire exercise is to see who has the least humanity
T H E M O R E R O B O T I C Y O U R S T A F F T H E B E T T E R
Say your the support staff and you say something like "Sounds like a shit storm, I will send an engineer pronto" you might as well go home after that call, 'cus you just lost your job; you said the word SHIT in case you missed what I was typing
No, the New World Order is we are all French; We must wear and speak what we are told.
Only the "Real" French may tell what these are.
You are Brandy
We are Cognac
It's not always cheap to do it, but it is extremely effective. For Messaging we work with this crazy-good outfit from Germany called HyperSoft - every 10 minutes or so, it will - for example - go out and open an Outlook inbox and send a message - the response and route times are captured, graphed, etc. It can't catch everything - not without having a monitored inbox in every Exchange storage group - but for the type of nebulous/hard-to-determine-origin problems discussed here it can be very effective for throwing the flag up that something is going on. It's either that or waiting for the users to call and complain, and in that case you have no historical baseline to reference.
For the record - Buzzword :) - we do this for some of our outsourced clients. Problem is that we don't always have the budget to do it - when we sell stuff the client has the option of funding this type of tooling or not funding it depending on what type of SLA reporting they ask for.
re: Synthetic Transactions
"OmniContext™ metering toolbox measures and reports to manager or technical team all aspects of companies’ workflow such as speed, quality and automation of workflow processing helping to track performance issues regardless of workflow complexity or number of tasks and people involved in collaborative projects"
Sounds like nothing to do with network issues. Something HR would find more usefull in monitoring employee performance. Reminds me of when I worked on the helpdesk of an ISP. Management brought in some helpdesk software to make things more efficient. All it consisted of was a bunch of forms and tick boxes we had to fill in as we took the call. Since filling in the forms took all our focus we had even less attention to donate to the caller. Screen update was so slow we ended up opening up a bunch of "tickets" in one go, filling them in and then posting them all at once. That way management was satisfied and we could get on with actually doing our job.
IF you want to diagnose network lag, you could fire-up NMAP and WIRESHARK and see what unnecessary traffic is running on your network.
Network is just one potential cause
...and more usually, it's something at the app layer that is the cause of the slow response. Wireshark is *great*, don't get me wrong, but it's more - IMHO - for digging into a network issue once it's found vs. finding a slow/no response issue of undetermined origin (which might not be network) to begin with.
Also, the Hypersoft product I was referencing was OmniAnalyser (http://www.hypersoft.com/oa_services.htm) - I'm not familiar with OmniContext although I think they refer to it as a Business or maybe Organizational Intelligence tool, which is something completely different that what we use it for (mainly "Service" SLAs).
When I'm in the room with the client to discuss this type of reporting, it's sometimes referred to as end user experience monitoring. Not to beat a dead horse here, but synthetic transaction monitoring vs. log scanning, process monitoring, or ping/heartbeat monitoring (i.e. traditional "server" monitoring) *is* a completely different beast. Performance degradation caused by, for example, problems with the SAN on the back-end is not something that traditional monitoring - in my experience at least - will tend to catch. The Synthetic Transaction monitoring will tend to catch them.
Also, for the record, Hypersoft isn't the only vendor that does this type of stuff - we have also scripted similar functionality on our own before with certain platforms.
An even simpler solution
Of course, they should have suggested that she turned the machine off and then on again
Not that expensive or difficult anymore
I have seen many extremely complex and costly Service Assurance or wider OSS implementations as they call it in the Telco industry... sometimes totally failing the main "Assurance" point.
We have since built through experience an approach and ready to roll-out methodology and toolset (open source based.. but really integrated slashing the TCO many times) to finally get to a point where user-level and platform monitoring can anticipate those dreaded user calls (the calls not the users of course). If there's a call, as said above, it is past the point of annoyance and it is affecting the business already... it means it is way too late and burning. For web applications at least, and even legacy ones, there is no excuse not to have at least something to learn from incidents and know exactly where to look over time.
If you put a bit of pragmatic IT Service Management methodology on top, IT engineers working life finally have a chance to stop scratching their heads figuring out when those seconds of response time started to become an issue.
Rube Goldberg Was A Piker
The fundamental problem is simple: most implementations are stiched-together monsters made up of parts that should have been left at rest; and mainly not intended for their current purpose.
There is no getting around that; and many "solutions" (hung on like a bag on a kludge) only make things worse. I have seen multiple instances of "monitoring" schemes that turned out to be the performance issues they meant to prevent. Some of them even create fake transactions that then produce errors in processing; spitting messages that then need to be ignored. Of course; once you start crying wolf, no one notices when the pack really is at the lambs.
(Sometimes, programs abused by these "tools" will be hard-coded to ignore messages for certain user IDs.The IDs are then issued to real users a couple of months or years down the road. That's real fun to figure out.)
So long as we continue to build serious systems out of Tinker Toys (like IP and Web Servers) we will continue to create shambles that need constant patching and propping up; and sometime the structures will just implode or burst into flame.
The fundamental problem
Is that management always want to put in short-term fixes for long-term problems.
There's never enough time or budget to do it right, but there's always time and budget to do it again.
Just wasted 5 minutes reading this obvious crap
There is usually only one reason for this situation - the IT dept not being able to do their jobs properly. In my experience this is usually caused by management "managing" and making decisions in an industry they know nothing about or/and IT staff simply not skilled enough to do their jobs ignorant to fact there is anything wrong.
all too true
These types of events happen more often than not, this is all user mentality and in many is the perception that they don't want to bother the IT people for a self righting problem. Sadly those types of problems are mearly temmors towards something bigger and its when it goes bang or breaks is when It hear about things. Not all It kit is created equal, big tin IBM mainframe type enviroments are less likely to experience such issues becasue the monitoring is more hands on and the internal hardware monitoring is levels beyond desktop PC's. Indeed rack servers still catching up on many levels in the x86 market. Not long ago I noticed a server that had this burnt solder smell to it, ans suspected something was wrong, yet all management/server diagnostics/logs showed nothing, few days later it died. In that situation the server was working fine, the logs etc showed nothing but at some level it was starting to show signs of going wrong but nothing that could be attributable to justify action at that stage beyond being mindful and keeping an eye out. Now had it had flames comming out of it or smoke then nomatter what diags/logs say and despite it running fine it would be pulled, simple as. but having a smell that I'll admit not all could smell, though another was able to descern as well *yay non-smokers*, was no justification for direct action. And this is the crux in many cases; whilst the signs are there, are they documentable enough to justify action. In many case's and the one you outline, user rings up - system going slow, then after logging lots of details, oh it seems to be fine now, case closed. That's it in a nutshell on near on everything, if its not an issues its closed off. Once closed its just a statistic and nothing of use beyond it was running slowly at this date/time, that could lead somebody to check what changed on the server back then and delay the fact the network card was failing due to a heat issue in placement that only became apparent due to extra load from the yearly accounting runs, small details like that get swallowed up from you looking at it from the IT perspective and sidetracked by some date in the past. Thats what happens when issues are closed off quickly as the problem fixed itself. It's also the case of how do you map business knowledge like the increased periodic workloads into mangamnet software that monitors hardware today. Its's hardware/software aware not business aware. The fact that the accounts department has increased for a 2 week period from 10 people to 15 people is something most servers are never told funnily enough. Sure logins added, but that tells you nothing about workloads.
Though it gets down to the real issue of most people being prevented from being proactive and those that do end up hitting a wall of others who don't see an issue now, when it happens then we can look at it. Companies as they work at the worker level end up preventing there staff from proactivly addressing problems and it gets down to the fact that they dont have a charge code for a issue that don't exist yet. As for Quality Assurance; It's such an open aspect of IT that boils down to the rebranding testing and calling that QA and thats it sadly. As for the user they just want something that works, and as far as there concerned call it and do what you like as long as it works. How do you firstly define what QA is needed for a user who outlines there requirements like that and then does the user need to outline there requirements beyond that! This then gets down to we tested it you approved/tested it and a sign off is agreed, beyond that its bugs which we know will happen as nothing is perfect. But do your accounts department budget and account for mistakes and if so if software costs 1m to build, would you seriously see them in reality budget 3m for bugs/oss issues. probably not, but what is the reality in relationship to a hardware/software projects costs to deploy and what it costs down the line overall once apart fo the business. Realy gets down to listerning to staff and paying them to speak there minds saftly without a p45 for going and speaking the truth sometimes.
Do you think you could manage smaller paragraphs next time please ?
Fixing problems that haven't happened yet
> I noticed a server that had this burnt solder smell to it
This is one of my continual bug-bears. What is the point of having tools (or noses) that detect *potential* problems, when all the business processes are intended to be reactive? In theory it's possible to raise tickets on predictions, but the priority is so low that they never, ever get addressed.
I am frequently able to detect and report bad things that WILL happen in the future. Whether it's disks filling up, batch jobs trending upwards over time, email response times slowing down or a multitude of other possibilities. However for most of these the support teams express no interest whatsoever. The reason is that to nurse a non-urgent change through the system takes a huge amount of time and effort, whereas an emergency change after a crash gets approved just like that <clicks fingers>.
So from an "efficiency" point of view i.e. doing the least amount of work, it's far better for the techies to wait for a failure (after all, it might not happen) and then look like a superstar by fixing it, rather than doing preventative work that takes time, but never gets noticed.
With outsourced systems it's even worse. Approach the support team with a "did you know disk X is about to fill up?" question and the best you can expect is that they'll ask you for your cost code, so they can investigate _your_ problem without paying for it themselves. We've pretty much given up trying to help these guys and now we just let their systems crash and burn.
When the system's running slowly...
...the first question a Performance Detective always asks is:
"Is the support for this system outsourced?"
If the answer to this question is "yes", then that explains everything.
Instead of designing from the client down, design from the backend up. That way the only software any specific desktop machine runs is the software required for that specific desktop.
Me, I use BSD on the servers & Slackware as a base on the desktops. Try it, you might like it.
Or, you can continue struggling with a commercial based top-down approach ... I mean, seriously, who in the fuck REALLY thinks that basing a corporate network around an OS that most users don't even touch 0.1% of the lines of code of the software installed on their desktops is a good idea? HOW many gigs for a minimal system these days? I can do the SAME EXACT office work with DOS 3.3, Wordstar & Lotus!
I wonder how fast the US and State budgets would be balanced if Redmond & Cupertino were banned from all state & federal systems ... Yeah, there is a learning curve, but you only have to learn it once.
 Don't give me "but what about Power Point?" crap ... I fire Power Point aficionados on sight ... and have done similar since the days of Ashton Tate, and I'm not talking dBase ;-)
Not a waste of time
While partially agreeing with 'Just wasted 5 minutes....' (not the' load of crap' part), I feel that the type of training given these days has resulted in a lot (most that I come across) of IT techies, manangers and users not having much of a clue what is really going on.
The complexity of systems (all mentioned in previous messages) require a good knowledge of the basic principles of IT, networking and the equipment used to facillitate it. For some time now I haven't seen many courses that offer really good 'training' ; it's all quick and dirty, 'lets make as much out of training with as little input as possible'. Same problem in a lot of technical fields these days.
'Never have so many known so littile about so much'.
Old and grumpy!
Can't manage what you don't measure...
I agree that the user experience is important and should be monitored. If possible, you should consider whether it's possible to "bake in" the capturing of suitable performance information from the outset.
One reconciliation system I worked with a few years ago, had the inbuilt ability to record the details of what the user was doing (eg what options they had selected when performing a particular task) along with the time it took to perform and how much information it resulted in). This could then be recorded in a separate database for review/analysis.
All too often the "lone user" performance issue is difficult to quantify "it seems slower than yesterday" with nothing to actually back it up.
Having the details of what the user was doing "Hi Fred, I can see you ran the same report yesterday and it took only 2 seconds longer today..." can be a powerful tool. Once you have that raw data you can start doing more useful things with it (eg is the average time to run a given process creeping up and up over a period and if so, what are you going to do about it).
The view that support folk would like to deal with a significant meltdown in their systems in my experience couldn't be further from the truth. Whilst a frontline L1 helpdesk call handler may not have the time/knowledge/incentive to proactively suggest improvements, if the systems are that complex, they'll be backed up by the relevant L2 application support groups who will have a vested interest in ensuring that improvements are introduced proactively.
At a basic level, it is human nature, if you can do something to reduce the "noise" generated by avoidable support requests then you'll do it. The beneficial side effect is that hopefully the user experience is also improved :-)