Email and OneDrive
I couldn't access OneDrive and Exchange for about 1-2 hours today. Everything's back to normal.
It seems Microsoft's Office 365 is having an unscheduled nap as users across the world report difficulties logging into the administration portal. Office 365 is having issues since 04:51 AM ESThttps://t.co/3GGuh8EZs8 RT if you're also affected #office365down pic.twitter.com/ixHgD7YHv1 — Outage Report (@ReportOutage) April 6, …
Because it tends to be cheaper and easier than running your own resilient email system, if you're only managing a handful of users. Sure, it goes off from time to time, but that passes and in a few days it's forgotten about.
It's like when you drink tea that's too hot and it burns a bit on the way down, but a few seconds later you're fine and keep drinking the tea once it's cooled a little. People are funny like that.
Substitute coffee / hot chocolate / hot beer / hot absinthe or whatever you like to drink hot in lieu of the tea...
We run our email in-house, over multiple servers, in multiple sites.
Don't use it as your ONLY resource, I think should be obvious.
Such cloud options are great. As one option per cloud. But you should have in-house stuff too, or you're entirely reliant on a) your Internet connection and b) Microsoft. There's no reason you can't run a secondary email server, AD server, etc. in-house. If you do it right, I can't see a reason that external users would notice anything had even happened if the entire cloud went down.
But it's "lazy IT". Let's just pay a monthly subscription, then that's us sorted because "it's Microsoft". They don't give any thought to business continuity. Sure, it can work at some levels of business. But if you are a corporation of any size and you're affected by this,.. you obviously put too many eggs in the same basket.
"But it's "lazy IT". Let's just pay a monthly subscription, then that's us sorted because "it's Microsoft". "
That 2nd sentence is an MBA laden CxO talking, not rank and file "proper" IT. IT much of the time doesn't make its own budgets or decisions on stuff like this. A couple of dinners with some MS sales guys at the CxO levels, and this sort of thing gets pushed through over IT's objections. (been there, got the t-shirt). I would say it's probably more Lazy management that's trying to trim both the personnel and equipment budgets while simultaneously getting nice meals...
Many other people had already learned about this, long before the latest occurrence. (Based on what I've seen from you here, you're one of the people who saw this coming).
So perhaps the question that really needs answering might be "why does persistent IT incompetence on this scale (and with no improvement in sight) seem not to matter to the people who pay the IT budgets?"
NB "IT budgets" also needs to include anything where computers and software play an important part of product functionality, e.g. automation equipment, 'self driving cars', a whole pile of computer-dependent stuff.
"why does persistent IT incompetence on this scale (and with no improvement in sight) seem not to matter to the people who pay the IT budgets?""
Most high level management these days is basically a bunch of gamblers who are taking the gamble that the reductions in resilience (or in MBA speak.. "savings") that generate large bonuses for them will not come back to bite them in the arse before they retire or move on to the next schmuck of a company that believes their interview bull.
"...that generate large bonuses for them will not come back to bite them in the arse before they retire or move on to the next schmuck of a company that believes their interview bull."
I don't think it's that. I mean most bad IT decisions don't actually save any money, not even in the short run. In fact many even have short run negative consequences.
My hypothesis is that there are many IT departments which believe _anything_ a salesperson will tell them. That's why companies _still_ run antivirus software, even though it's benefits have long been disproven by both theory and practice. That's why companies still invest in "office productivity" software like Microsoft Office or Open Office, even though those mostly cause your employees to waste time on things they don't know how to do, like making a printed document look good.
The problem is that the people making the purchasing decisions don't understand technology at all, so they don't question what sales people or random websites are telling them.
Most such purchasing decisions are not made by the IT department, but even the IT dept often don't have much of a clue either. The requirement for staff increased much more quickly than the availability of skilled staff, so companies have to take whatever they can get - including people who don't have much of a clue.
"why does persistent IT incompetence on this scale (and with no improvement in sight) seem not to matter to the people who pay the IT budgets?"
It is, perhaps, because people who know how computers work, generally don't work in computer administration any more, they move on to higher paying jobs like programmers.
There are few exceptions like that game company where the employee handbook leaked recently. They had a system where you had desks with wheels you could move around by yourself. As soon as you plugged it in on your new location the floor plans would automatically get updated.
Since most IT-departments are horribly bad at what they are doing, most people have never experienced a good IT-department, which means that they don't demand it to be actually usefull. Good IT is to rarely seen as an enabler of success and an effective motivator of your staff.
It is, perhaps, because people who know how computers work, generally don't work in computer administration any more, they move on to higher paying jobs like programmers.
I have come across programmers who appear not to know how computers work.
Because the people that are paying the IT budgets like shiny things and savings. They are fooled by so called digital directors who are just hipsters in disguise with no clue about real IT work. These so called directors want to be "Infrastructure free" because it will be a "massive cost saving", clearly showing they haven't a clue about how the cloud or local IT actually works.
Problem again is, the people that pay the budgets listen to this fool.
'So perhaps the question that really needs answering might be "why does persistent IT incompetence on this scale (and with no improvement in sight) seem not to matter to the people who pay the IT budgets?"'
Because they collected their handsome bonus for "cost saving measures" and left the company before the consequences became obvious. They're the same people who were happy to be sold a bunch of outsourcing crap; the same ones who paid small fortunes for endless streams of suits wrapped around MBAs from three- and four-letter professional-grade-bullshit (PGB) outfits, to be told, during an elaborate Death By Powerpoint, how to synergistically leverage the business process enhancement matrix for maximal mission-driven shareholder return which, strangely enough, meant cutting costs by sacking the very few remaining greybeards who knew how things actually worked.
In short, "the people who pay the IT budgets" have long since learned to trouser the bonus and skip out before the scale of their monumental ignorance and disastrous incompetence become apparent. Many of them have CVs to leave you leave you breathless with admiration, when you know that they have had 20-year careers fucking up every single thing they have touched.
But do not imagine that they are entirely worthless meat: to the saleslizards who work for the aforementioned PGB consultancies, such executives are as a herd of plump gazelle to the gaze of drooling hyenas ... fat, dumb and oh so tasty.
Unless you can actually check each item on a CV for veracity, it is wise to assume it to be at best wildly inaccurate and a heavily manicured version of the truth, and at worst complete fiction. By way of example, a former colleague of mine describes the time when we worked together (as low-end dev-ops on a Remedy ARS system) as the time when he was the team leader of that entire section.
Which, to be brutally honest, was complete and utter balderdash, wild exaggeration and outright fabrication of what he was actually doing.
Employing a person on the basis of information that they provide which cannot be independently checked is utter folly.
All brilliant Ideas in theory (not relying on the cloud, or any single source).
I suspect my situation is like many others.
Can you imagine what the answer would be (or in fact, was) when you ask for all the relevant kit, software and licences to have an on-prem exchange box (for example) *AND* the cloud offering, when the previous iteration of people in your job sold them on "the cloud" to save money?
I'll give you a hint. Two letters, started with "n", ended with "o".
The upshot was we've had approx 1,5 hours of slightly flaky emails, and skype refused new log-ons for a bit (but worked for those logged in). Now all fixed (10ish-12ish today).
In our business neither thing being down was a significant cost or risk, so correct decision made (as those who said "no" would see it, they're probably right).
Oh, I'm quite sure there are tens of thousands of places that do just that.
Cost-benefit analysis is fine, so long as someone did it.
It's when people whine that "Oh, my 4G credit card reader is down, I can't take cards, my business is in ruins" or "WE CAN'T GET OUR EMAILS!", when they haven't bothered to take such a loss into account that bothers me.
To be honest, everything in the world from GMail to Azure, AWS to IBM will be down for things on that order of magnitude, no matter what they promise. They have to be. But it's what YOU do about it for your circumstances that matters.
Emails will delay an hour or so and then come in later if it comes up. Skype shouldn't be a business-critical tool. Your remote workers not being able to get in on VPN for a bit is no worse than someone tripping over a plug. But when you whine that your work-at-home telesales can't dial into your VoIP VPN for a fraction of a second and it's costing you money, I have to just think "Okay, so what was your backup?"
@Lee D Absolutely.
In my Yoof I worked for, cough, *a* yorkshire based ISP.
With every single outage would come the angry phone calls and even speculative invoices from people who "can't operate my business" and are "losing thousands an hour here".....
All from their 9.99 a month (single) home ADSL account that happened to be down.
Same deal innit.
o365 isn't actually cloud. It's hosted. *IF* if was actually cloud, it would be more resilient and less susceptible to the issues that keep downing it. :)
Sidenote - In a meeting with MS o365 and Azure Sales and Tech, the o365 drone said "cloud", and I corrected him, with the Azure guys nodding with me. :) They weren't happy about the o365 guys throwing the word "cloud" about when they are a hosted service...
Hosted is someone else's computer - and o365 is totally hosted, with little in the way of resiliency.
Strictly speaking, Cloud is quite dynamic/automated and much more resilient and distributed. However... Marketing weenies will call everything on the internet "cloud" to keep a marketing hypewagon of buzzwords rolling. So at this point, the term cloud is virtually meaningless...
Cloud is a buzzword, simple as that. No real meaning, except what that shiny-suited salesman wants it to mean.
A cloud-hosted solution is a data centre full of servers on which the service is running. The only difference is that the cloud-hosting operator can probably do it a bit cheaper than you can, through economies of scale. The *other* difference is that you are not only relying on the cloud hosting machines to stay up, but also all the networking kit between you and them, plus other associated gubbins like DNS and the like.
When it all works, it is cheaper. When it doesn't, what ho, you got what you paid for.
*ALL* Microsoft hosted services were inaccessible for our business during this outage. Skype for Business, OneDrive, SharePoint Online, Outlook etc. Much more disruptive than just the publicised "Can't access the admin portal for Office 365" statement. Communication from Microsoft during this major EMEA wide outage has been very poor.
> as those who said "no" would see it, they're probably right
Yes. If you're big enough to afford a proper IT staff, and the necessary software licenses, and redundant hardware, then maybe you can do it better in-house. But for smaller outfits the cloud often makes more sense.
Sure there's going to be occasional downtime, no matter how many 9's your cloud provider claims, and of course you still need to keep backups somewhere other than in that same cloud, but unless you can afford multiple competent IT people (more than one, since they tend to take vacations sometimes) and redundant hardware (elsewise you have a single point of failure in your one crucial server, and how long will it take to re-build the server when it eventually fails) you can probably get better uptime in the long run if you use cloud solutions.
It's like mains power: a few sites really do need a large onsite generator and etc so they can ride out multi-hour power outages, but that's very expensive. So most people just have enough UPS so they can run for a few minutes and then safely shut things down if the power doesn't come back quickly. Long power outages are rare enough that for most businesses it doesn't make economic sense to maintain your own fully-capable power-generation equipment. So it is becoming with IT.
What is this "vacation" of which you speak? You keep using that word, I do not think it means what you think it means.
For a small business Office 365 is more cost effective than running exchange. If you get more than about 25-30 users it becomes less so over time. Problem is convincing people to spend the dosh to do internal servers in a CAS array for a small business is not going to happen most of the time.
The cloud is not always suitable for small businesses either. One has to consider data privacy laws such as the US HIPPA that have strict requirements about data access, etc. The cloud is also nothing more than moving an internal server farm to an external server farm with supposed cost savings (or more accurately more obvious costs) and possibly some operational efficiencies.
One major difference with external vs internal is how many are affected when the service goes down. Internal only affects one company while external affects numerous.
Yes. If you're big enough to afford a proper IT staff, and the necessary software licenses, and redundant hardware, then maybe you can do it better in-house.>
I worked for a client who are not a small outfit by any means. They went over to o365, and I think was the right decision, even though at the time I did not think so. And that opinion is purely from having subsequently seen the inertia that appears to be endemic in their IT department. So, if Microsoft can deliver a better service, good for them - the in-house IT only shot themselves in the foot, IMHO
On the main tenant we manage we never experienced any issues. Maybe because we run a hybrid setup and adfs authentication, we’ll see when (or if) they post an RCA.
Another tenant gave an error that said something along the lines of “could not authenticate at the time”, but a third one (and the first one I checked when calls came in 2 to 3 hours ago) said “user unknown on this system”. I very nearly shat myself.
It seems the crisis has passed, and fortunately before beer o’clock. Another notch on the “bad microsoft” stick, but I still wouldn’t go back to running it all in house... the absolutely critical, losing-money-if-its-down stuff : yes. Email ? IM ? Sharepoint ? Hell no. I wouldn’t be able to get the uptime of most cloud providers without spending huge amounts of dosh.
We've been 365 for around 5 years now, this is the second, short, outage I remember.
One was a failure to register new accounts and it lasted around 3 hours, one was today and we lost email (for some not all) and skype wouldn't accept log ons, neither would the admin portal.
Not brilliant no, but you'd only need one major exchange snafu or hardware snafu and you'd breeze past that downtime amount anyway.
It's a crying shame it all came back up so quickly. There I was having a nice unexpected Friday morning catching up on El Reg news stories from across the week... and then at about 12:30 it all comes back online. Outrageous.
Totally inconsiderate to the needs of the *cough* hard working *cough* IT chap this Microsoft lot are.
I love the prelim root cause analysis and fix: "Engineers determined that instances of a backend service responsible for processing authentication requests became unhealthy preventing requests from completing.
Mitigation: Engineers performed a recovery of the impacted backend service . "
So no info at all. Someone unplugged something and plugged it back in? Someone shut down a service and restarted it? Someone ran too many copies of Doom II on a 4mbps Token Ring IPX/SPX network then went to lunch once victorious?
"So no info at all"
Seems adequate to me for a PRELIM RCA. The details will follow - and what they will do to stop it happening again.
Here is an examples of a completed one:
Title: Can't access OneDrive for Business
User Impact: Users may have been unable to access OneDrive for Business.
Final status: We’ve determined that a load balancing configuration problem caused a resource scaling issue, which prevented some user access connections. Our automated recovery system restarted the affected components, which remediated impact. Additionally, we're deploying a fix, which improves the sync client resiliency against a future service outage, which is expected to fully deploy within in the next five days. You can download the new build immediately using the following link: https://oneclient.sfx.ms/Win/Prod/18.025.0204.0009/OneDriveSetup.exe
Scope of impact: Impact was specific to a very small subset of users who were served through the affected infrastructure.
Start time: Tuesday, March 6, 2018, at 11:05 AM UTC
End time: Tuesday, March 6, 2018, at 1:40 PM UTC
Preliminary root cause: A load balancing configuration problem caused a resource scaling issue, which prevented some user access connections. This issue was exacerbated by a sync client resiliency issue.
- We're reviewing our monitoring services to look for ways to reduce detection time and more quickly restore service.
- Deploy the fix that improves sync client resiliency against future service outages.
Message to the gullible, the cloud is no more more reliable or cheaper than your current on-premise stuff.
Prima facie evidence is MS pushing evryone to Azure where they can bleed them dry, monthly, til the end of time... wether it's working or not. Office 363 anyone?
Ah, I remember when IE4 didn't work with the shiny new IIS 5. But Netscape worked perfectly.
They'll never change, will they?
Sirius Cybernetics Corporation? Of, they dream of making products as reliable as those from the Sirius Cybernetics Corporation*.
* before the Adams fans pile into me, yes I know. That's the joke.
Nan's just been on the phone in tears! The outage meant she couldn't access the recipe for pineapple upside-down cake kept on Onedrive and the Vicar had to be given shop-bought cake. Well done MSFT for embarrassing a little old lady.
Also 4,000 staff at my company have been idle.
Microsoft had problems with Outlook.com earlier in the week. It said it couldn't access my mail box starting Sunday night. Then it said that my account was terminated. Luckily, that wasn't the case, but, it was a pretty alarming message to receive. I'm thinking they must have deleted and recreated my account when they did a restore or something. Pretty scary though. I'm not even sure who I would contact if they somehow deleted my account. All my important email goes there. I would be screwed if I lost that account. Issues like this are precisely why I don't trust the cloud. I found a status page that can be checked at https://portal.office.com/servicestatus. I need to figure out the best way to start backing up my email.
All my important email goes there. I would be screwed if I lost that account.
Issues like this are precisely why I don't trust the cloud.
Sorry, these two statements do not marry up. If you don't trust cloud services why the hell would you trust your important email to a Hotmail account??
I'm not talking about work email. I'm talking about my personal email. OK, so, I could setup dynamic DNS and get my own domain and setup my own email server. That is a lot to do for just one person and not something that 99.999% of people do. And yes, the cloud and these corporations cannot be trusted. What do you do for email? Run your own server?
"What do you do for email? Run your own server?"
I almost didn't bother to sign on, but...
Get a client. Run POP and set it to leave the messages on the server or delete after x time. There are several free clients. Thunderbird seems to do a decent job, at least until they kill it. If you are really concerned, then back up the mail folders periodically. It's also a hell of a lot more user friendly.
What do you do for email? Run your own server?
Well, as it happens, yes, I do.
But for my Hotmail I use Thunderbird, instead of webmail, so I can download all my mail from that account to a local machine, and then I can back it up on a USB drive as well.
Here is some basic math that won't give you a perfectly accurate answer, but close enough.
If we assume that this lasted 3 hours, and that the system has been fine for this year, then we have (31+28+31+5)=95 days = 2280 hours of successful operation, not counting any of today. Therefore, we currently have (2280/2283)=99.869% uptime. We want it up to 99.999, meaning we need a total of (3/0.00001)=300000 hours of successful uptime. We already have 2280, so we need 297720 more hours, that's 34 years. That changes if the outage was not exactly 3 hours, but more so if we get to count last year's performance. I don't operate an office365 system, but another poster said this was the first outage in 28 months. We'll knock off 4 months for my calculation of this year. That still moves them up to a current performance of 99.985% and their additional no-downtime required down to 32 years.
So they don't really have a chance for five nines, but they definitely make three and four is almost guaranteed. I might not consider this for extremely critical things, especially if there's a risk for disconnection from the network on my end, but their infrastructure performance is not bad so far. Let's see how long it is until the next major outage. If it happens really soon or takes forever to fix, I'll consider dropping my semi-endorsement of the system. For now, I think they're fine.
At my work, they just switched to Office 365 from running local Exchange servers. And guess what, now, local mail delivery is slower now. Who could have every foreseen that. By all means people, lap up everything the corporations tell you and help fatten their wallets being a slave to them.
Yeah, funny how all the added complexity being employed that is supposed to make things more reliable ends up making it less reliable. I think a lot of people in IT are just bored, so, they come up with overly complicated systems to try to puff themselves up and pad their resumes. Cloud and micro-services is the current trendy thing to do. Never mind if it has any benefits. You have to do it or you're not cool. Sorry, but, I'm not buying into a total load of BS.
All you fans of OPEX that were instrumental in getting your on-prem workloads migrated to cloud because of evil, wicked CAPEX... time to revise your OPEX costs for this month. Be sure to include the hourly rate of every employee adversely affected by the outage, multiplied by the number of hours that the service was down.
Those of you with more than a couple hundred employees will be granted a ten minute grace period to go weep quietly in the corner.
"include the hourly rate of every employee adversely affected by the outage, multiplied by the number of hours that the service was down."
One place I knew used to have an "IT Downtime" task code on the time sheets, to be used by staff whose work could not usefully be processed when servers were down for more than the usual hour or two at a time.
When the company-wide effects of the IT Crowd's efforts became obvious in the lost hours visible in spreadsheets at Manglement level, did they do something about the performance of the IT Crowd? No, they removed "IT Downtime" from the timesheets, and did nothing about the downtime.
In a similar way, when the "employee satisfaction survey" came back year after year with negative results and even specific complaints about the competence of the management, rather than address the issue, senior management stopped doing the survey.
Dilbert would have been delighted.
One reason why I don't like cloudy services. Your entire company's at the mercy of said cloudy services. And there's also other factors to consider - internet access etc etc etc.
Also keep in mind that the techs at the cloud centre/host are all "shared" amongst more than one company. Meaning if one company experience major b0rkage, then all hands may attend to said b0rkage, leaving you to wait for somebody to attend to your small problem.
I work for a MSP and for a large client about 3 yeard ago I put in place a 4 node Exchange DAG with a pair of load balancers across 2 sites. Not a minute of downtime since it went live.
Hearing they were moving to O365 about 6 months ago saddened me after putting in place a system which I was extremely proud of. (Had a site failure and users didn't even know).
Whilst I'm sure this won't change their plans, hearing today from them that they had email issues for the users they've migrated did at least put a smile on my face briefly...!
I have a life, when you have the responsibility (with the world watching) and no one to ask, I know, scary, the numbers simply don't stack up below a certain user count, and/or roll your own or outsource. I know, I've cooked every broth of every flavour. I'd rather not have the responsibility and a global team of people doing their stuff. Multiples nines is seconds. #inthepub
Can we all call it Office 320 from now on ?
Maybe slightly unfair, they made incredible progress this year ... YTD last year they had accumulated 14 days of downtime, iirc ... this year it is much better ...
They finally hired someone who understands certificates and renewals ... if they could get Teams to work in Firefox I would appreciate it ...
I forgot, I do not use Teams by choice, that lousy piece of unreliable crap was forced upon me by corporate fallacy ...
I am waiting for our IT boss to get his head out of his arse and realize that the recent MS vs FBI court case, well, Trumps bill that annihilated it, means that US clouds are not compliant with the GDPR ... I will be sure to remind him before the deadline ...
Icon: I always loved Trump, your great leader, thick as bone!
I remember in the 90s just after I got my IT career started a huge glut of people coming out of college and into IT, then they all ran off and "earned" Microsoft qualifications in this and that. All of a sudden the IT world was a awash with shysters who promised they could improve IT by shipping out tried and tested high quality systems in favour of Microsoft across the board.
The problem was not Microsoft, they're software is not really any better or worse than anyone else's the problem was that like any saturated industry the IT industry was full of fecking cowboys who'd never even seen the inside of a computer let alone administered one. They all got jobs in middle management positions without a single fecking clue how IT systems worked.
Fast forward today and what you have is companies with an inherent distrust of IT depts. Due to the constant screw-ups by these shysters that gave real IT people a bad name, companies just hate IT depts. They see IT as a necessary evil to be tolerated. Now we come to the cloud age and companies that hate their IT depts can see a cheap alternative by buying cloud services. Simply pay BT to hook up a fast internet pipe, buy a cloud service for your email and storage, sorted! Now you scale back the IT dept to those who can be trusted to set up mobile devices and use a browser to admin the cloud services and you're IT budget is slashed by 70%, your IT workforce is slashed by 70% and you're pormoted for making the company leaner and more modern.
Now all those companies like Microsoft that crated the glut of idiots with meaningless qualifications are selling cloud services. We did this to ourselves. As the tag line says at the top of this site, "Biting the hand that feeds".
Enjoy the last days as IT Rome burns. Do what most sensible people in IT are doing now, set yourself up another career path 'cos IT is dying. Get down the local college and get yourself trained as a plumber, gas fitter, photographer, artist, whatever you like. If you're in IT now and you don't have an alternative career path on standby, you're a fool.
I started getting reports of problems Friday around 3.30 PM (GMT+8) and spoke to our MSP. At this point they reported that a couple of their clients were seeing the same problem. Shortly thereafter confirmed it was a Microsoft issue.
We run Office 365 including Sharepoint online extensively as a way of minimising required infrastructure in house ( I am the first in house IT person they have had) and the likelihood of an in house exchange server is minimal. The main reason is that they have staff on site at Minesites who may have access to Internet but anything that requires heavy traffic is going to be a big no-no. Even running Citrix desktop sessions is too difficult for some of the site guys so they run a localised version of O365.
By 7pm the issue had been resolved, so well done to the technicians who managed to repair the backend that caused the problem.
Biting the hand that feeds IT © 1998–2019