* Posts by Nate Amsden

2437 publicly visible posts • joined 19 Jun 2007

AWS power failure in US-EAST-1 region killed some hardware and instances

Nate Amsden

Re: Ever heard of a UPS?

It would probably surprise many, but there are some data centers that don't use any UPSs. Not to say they don't have power protection. They rely on a newer(??) technology known as flywheels that provide backup power without batteries. However these flywheels while on paper they look cool and they are nice in that they don't need batteries, their critical failing is they have generally very short runtimes (measured in seconds, maybe 30s tops). Supposed to be plenty of time for generators to kick in and take the load if everything goes smoothly.

But what if things don't go so smoothly and human intervention is required? That's my problem with the runtime of flywheels, they don't give people enough time to react to solve an on the spot problem or even get to the location of the problem before they run out of stored capacity. Maybe that problem is perhaps the automatic transfer switch fails to switch to generator, so generator is running and ready to take the load but requires someone to go force that switch to the other position to transfer load to the generator. That just scares the hell out of me and would not want any of my equipment hosted at a facility that used flywheels instead of UPSs. I was at one facility in probably 2004, very nice AT&T facility I liked it my first "real" datacenter experience. I was in the reception area when the grid power failed. All lights and computers etc in the reception area went dark. On site tech staff came rushing in to get to the data center floor(from their on site offices had to go through reception area to get to the DC) and reassured me the power was fine on the data center floor(it was, no issues there), though they struggled to get to the floor since the security systems were down I think?, but they did get in after maybe 30 seconds. I don't know where they were rushing to exactly maybe they had to go do something to the generators! I didn't ask but they sure were in panic mode. Power came back a few minutes later.

I wouldn't even trust redundant flywheels. I want to see at least 5-10mins of runtime available for generators to kick in. Ideally 99% of the time you won't need more than 30 seconds. I'm just paranoid though.

My first true system admin job in 2000 I built out a 10 rack on site server room. I equipped it with tons of UPS capacity (no generators). Had two AC units too. I was so proud. I hooked up UPS monitoring and everything. big heavy battery expansion packs on our APC SmartUPS systems, enough for probably 60+ minutes of runtime. Then one day the power kicked off on a Sunday morning I think it was, I got the alert on my phone. Yay the alerts work. Then reality set in about 30 seconds later. Systems are running fine on battery backup great. But...THERE'S NO COOLING. Oh shit, I drove to the office(5 minutes away) to initiate orderly shutdowns of the systems(doing it remotely at the time was a bit more sketchy). In the end no issues, but my dream of long runtime on UPSs had a fatal flaw there..

Flywheels were more the rage back maybe 15 years ago, for all I know perhaps the trend died off(hopefully it did) a long time ago and I just never heard since it's not my specialty.

Nate Amsden

Re: 9.5 hrs of downtime

Also wanted to add, what's worse than not having backups? I wasn't sure until I learned first hand. What's worse than not having backups is NOT TELLING ANYONE YOU HAVE NO BACKUPS.

When the most recent storage array went down several years ago everyone outside of the small IT team believed everything was backed up. It wasn't until the array did not come back up did IT management raise the issue that oh hey we don't have ANY backups of the accounting system going back TEN YEARS. I wouldn't of been upset if this was well known. But it was not. I wasn't involved in IT at the time, and well that IT director is long gone (not because of any reasons related to the incident at hand). IT team was doing backups, just none of that system because of technical limitations that again weren't communicated widely enough. We resolved those technical limitations later.

The array in question was decommissioned in 2019 (I think ~3 years past EOL date).

Nate Amsden

Re: 9.5 hrs of downtime

yes single data center. No DR plan. I've never worked at a company in my career (24 years) that had a workable DR plan, even after near disasters. Everyone loves to talk DR until costs come up then generally they don't care anymore. At my current company this happened every year for at least 5-6 years then people stopped asking.

Closest one to have a DR plan actually invested in a solution on paper due to contract requirements from the customers. However they KNEW FROM DAY ONE THAT PLAN WOULD NEVER WORK. The plan called for paying a service provider(this was back in 2005) literally to drive big rig trucks to our "DR site" filled with servers and connect them to the network in the event of a DR event. They knew it wouldn't work because the operator of the "DR site" said there's no fuckin way in hell we're letting you pull trucks to our facility and hook them up(they knew this before they signed on with the DR plan). They paid the service provider I think $10k/mo as a holding fee for their service.

That same company later deployed multiple active-active data centers(had to be within ~15 miles or something to be within latency limits) with fancy clustering and stuff to protect with DR. Years after I left. One of my team mates reached out to me joking they were in the midst of a ~10 hour outage on their new high availability system (both sides were down, not sure what the issue was I assume it was software related like Oracle DB clustering gone bad or something).

Another company I was at I was working on a DR plan at the time, it was not budgeted for correctly, I spent months working on it. While this was happening we had a critical storage failure that took the backend of production out for several days. There was no backup array, just the primary. It was an interesting experience and I pulled so many monkeys out of my ass to get the system working again(the vendor repaired the system quickly but there was data corruption). Most customers never saw impact as they only touched the front end. I got the budget I was fighting for in the end, only to have the budget taken away weeks later for another pet project of the VP that also was massively underfunded. I left soon after.

Current company had another storage failure on an end of life storage system, and guess what the IT team had NO BACKUPS. Accounting data going back a decade was at risk. Storage array would not come up. I pulled an even bigger monkey out of my ass getting that system operational again(took 3 days). You'd think they would invest in a DR or at least a backup array? I think so. But they didn't agree. No budget granted.

Rewind to 2007ish, hosting at the only data center I've ever visited to ever suffer a complete power outage (Fisher Plaza in Seattle). I was new to the company and new to that facility. It had previously experienced a power outage or two, various reasons, one was a customer hit the EPO button just to see what it would do(aftermath was all new customers required EPO training). Anyway I didn't like that facility and wanted to move to another facility but was having trouble getting approvals. Then they had another power outage and I got approvals to move fast. I remember the VP of engineering telling me he wanted out he didn't care what the cost was and I was literally at the end of the proposal process and had quotes ready to go. Moved out within a month or two. That same facility suffered a ~40 hour outage a couple of years later due to a fire in the power room. The building ran on generator trucks for months while they repaired it. It was news at the time even "Bing Travel" was down for that time they had no backup site. Several payment processors were down too at least for a while.

I read a story years ago about a fire in a power room at a Terremark facility. Zero impact to customers.

Properly designed/managed datacenters almost never go down. There are poorly managed and poorly designed facilities. I host some of my own personal equipment in one such facility that has had several full power outages over the past few years(taking the websites and phone systems of the operator out at the same time), as far as I know no redundant power in the facility which was designed in the 90s perhaps. Though it is cheap and generally they do a good job. I wouldn't host my company's equipment there(unless it was something like edge computing with redundant sites) but for personal stuff it's good enough. Though sad that there are less power outages at my home than at that data center.

Amazon and other hyperscalers generally build their datacenters so they CAN GO DOWN. This is mostly a cost exercise, doubling or tripling up on redundancy is expensive. Many customers don't understand or realize this. Some do and distribute their apps/data accordingly.

As someone who has been doing this stuff for 20+ years I believe people put too much emphasis on DR. It's an easy word to toss around. It's not an easy or cheap process. DR for a "data center" makes more sense if you are operating your own "server room/datacenter" on site small scale for example. But if your equipment is in a proper data center(my current company's gear is in a facility with ~500k sq feet of raised floor) with N+1 power/cooling, and ideally operated by someone who has experience(the major players all seem to be pretty good). The likelihood of the FACILITY failing for an extended period of time is tiny.

To me, a DR plan is for a true disaster. That means your systems are down and most likely never coming back. Equipment destroyed, someone hacks in and deletes all your data. Outages such as power outages or other temporary events do not constitute a need to have or activate a DR plan. But it really depends on the org, what they are trying to protect and how they want it protected. 99%+ of "disasters" can be avoided with proper N+1 on everything. Don't need remote sites, and complexity involved with failing over, or designing the app to be multi data center/region from the start as the costs of doing that are generally quite huge, for situations that almost never happen.

I've been involved with 3 different primary storage array failures over the past 19 years(all were multi day outages in the end) and having an on site backup storage array with copies of the data replicated or otherwise copied would address the vast majority of risk when it comes to disasters. But few invest even to that level, I've only worked at one company that did, and they didn't do it until years after they had a multi day outage on their primary storage array. I remember that incident pretty well I sent out the emergency page to everyone in the group on a Sunday afternoon. The Oracle DBA said he almost got into a car accident reading it. Seeing "I/O" error when running "df" on the primary Oracle servers was pretty scary. That company did actually have backups, but due to budgeting they were forced to invalidate their backups nightly as they used them for reporting. So you couldn't copy the Oracle data files back, at least not easily I don't recall what process they used to recover other than just deleting the corrupted data as they came across it(and we got ORA crash errors at least 1-2 years after though they only impacted the given query not the whole instance).

Nate Amsden

Re: Not surprised.

Funny story, back in the "earlier" days of cloud at my previous company(before I started even). There were regular occurrences of lack of capacity at amazon cloud (this was back 2009-2010ish). My manager told me there was more than one occasion where they literally had amazon support on the line as they were installing new servers, and instructed my manager precisely when to hit the "provision" button, in order to secure the resources from the new systems before others got them. I don't know if this was a routine activity or the result of the head of amazon cloud being the brother of the then-company's CEO.

I met with the head of EC2(now CEO of amazon) while at that company(late 2010 I think) along with their "chief scientist" I think they called him, don't remember his name. They basically just spent the meeting apologizing to us for all the issues we were having and promised to fix them (manager said that was their regular response and very little ever got fixed). It was a fairly useless meeting. But hey now I can tell people that story too.

Then there were also regular reports of "bad" instances out there people would deploy and re deploy again until they got the kind of hardware they wanted.

Nate Amsden

Re: Ever heard of a UPS?

UPSs are not 100% reliable that is true, which is why for best availability you design for N+1 at the datacenter level. Amazon would rather make redundant data centers than redundant UPSs/cooling systems. Which at some scale makes sense, but again app designers need to build and test for that, and most do not do it either at all, or don't do a good enough job(as evidence by the long list of companies impacted by such outages when they happen). The OVH situation was even worse, very poor/cheap design. Which again can make sense at super scale shipping containers were (maybe still are) popular for server deployments at one point but that's really a specialized situation not fit for general purpose computing.

I remember several years ago at the data center we use, they sent out an email saying they lost power redundancy, maybe one of the grid links failed I don't remember. My then director was freaking out wanting to send notification to others in the company of this situation. I don't recall if he did or not, but I just told him calm the hell down. You have a right to be freaking out if one of our actual power feeds drop(everything was on redundant power). But that never happened. They repaired whatever issue it was and things were fine, never lost power at any PDU.

We had another data center in europe which was more poorly managed. They did have a time where they actually had to shut down half of their power feeds to do something, then a week later or something shut the other half down. That was annoying. I didn't have to do anything, though our fibrechannel switches were single PSU, so one of them went down for each part of the outage but it didn't impact anything. Was so glad to get out of that Telecity data center I really hated the staff there. They didn't like me either. Telecity was much later acquired by Equinix. Had soo many issues with the staff and their policies. In the end it was me who caused them to change their policy regarding network maintenance. Until I complained loud enough they felt free to take network outages whenever they wanted without notification to customers. I just didn't believe anyone would do that but they were doing it, at one point taking half their network down, without telling anyone. They fixed that policy anyway but had tons of other issues. We moved out in 2018 I think.

Nate Amsden

Re: 9.5 hrs of downtime

It's been 10 years and that hasn't happened yet. Well systems fail and VMs are automatically moved over(often times before the alerts even have a chance to get out). In fact the org has cut BACK on providing protection we used to have products that provided automatic DB failover (ScaleArc) but the decided to stop spending on those I suppose in part we haven't had a primary DB fail in maybe 6 years now? I'm not sure. I remember one time a developer building something into their app and I asked why they did it that way, they said in case the VM fails and we have to rebuild or something like that. I laughed and said that doesn't happen, not since we moved to our own little data center. It was a regular occurrence in cloud though.

The apps certainly have failed from time to time there have been outages certainly, especially when the main app(s) enter feedback failure loops because of lack of really good testing. Those situations are pretty rare though.

But it is true that the org has come to expect super high uptime because that is what I've provided, so when something does go wrong, which is super rare sometimes they do complain.

There was one time in late 2012 when we were just slammed with traffic(within 1 year of moving out of cloud), we didn't have enough capacity on our app servers. One of the lead developers said, fuck it, take QA down give prod the capacity(that guy was really cool I miss him). I said, wow, ok I can do that in a few minutes if you really want. He said do it, CTO said go for it too. So I did. Powered off all of QA within 5 mins and gave the capacity to prod. Ran like that for a few weeks, they didn't care. Only had to do that once, we bought more capacity after that. There were other times where the app ran out of capacity but that was an app limit, there was no way adding infrastructure would of helped it(they eventually discovered and fixed those bottlenecks, or at least most of them, then the org scrapped that app stack(and the devs that built it left) and built a new one(with a new team) with even more issues).

The apps for years had many single points of failure as well, even when in cloud such as using a single memcache server that was super critical. Later the new app used memcache too even though we asked them not to they went ahead anyway. You know when I said there were more issues on the new stack? They later told us that the new stack hosted stateful data in memcache and they asked us NOT to reboot those VMs for things like security updates as a result. They could recover the data but it would be an app outage during the process. They eventually moved to redis in a HA configuration but it took years. I had our memcache servers running in vmware fault tolerance, though never had a host failure take one down so FT never had to kick in.

Big failures are rare on prem or in cloud. What on prem can help most with in my opinion is helping with small failures, which happen in cloud on a far too regular basis. Execs don't care much about those because they aren't dealing with them, whoever is managing the servers/storage are(which in my org would be my team), and that just means tons of headaches.

Two of my switches are still around and in production from their original deployment, I checked again yesterday they were first powered on Dec 20 2011, currently one has 3,655 days of service days and the other has 3,654. Everything else from that era was retired in 2019 or before.

I can recall of just 3 VM host failures over the past ~18 months. All 3 were the same host. I believe it is a hardware issue (DL380Gen9) but not enough evidence to determine what component(s) to replace. System behaves as if both of it's power feeds are cut at the same time which is not happening(unless it's inside the chassis). System ran fine for 3+ years before this behavior started. In the meantime I just use VMware DRS to prevent critical systems from running on the host until I have more data, so it's basically a non event when it fails.

Nate Amsden

9.5 hrs of downtime

For Chef's software stack (https://status.chef.io/), my alerts dashboard hadn't been that red in years. Fortunately I didn't have to make any chef changes today. Obviously they didn't have a disaster recovery plan(nor do most companies), they waited for amazon to fix their stuff then tried to recover what they could(at least that is what it seems like as an outsider anyway).

The list of companies affected by these outages(amazon included) just show building apps that are resilient to such cloud failures is beyond the reach of most organizations(whether it is complexity or cost or both). I've only been saying that for just over eleven years now. Not surprised the trend continues. I moved my org out of amazon cloud in 2012 and have been running trouble free ever since with literally $10-15M+ in savings since(would of been nice to get more of that savings invested into more infrastructure but the company was stingy on everything). There was no lift and shift into the cloud, the company was "born" in the cloud(before I started even). But still many people just don't get it(that cloud is almost always massively more expensive than hosting yourself unless you are doing a really bad job of hosting it yourself which is certainly possible, though much more common to host it in cloud very poorly then hosting it yourself). I don't get how you couldn't get it at this point.

Db2, where are you? Big Blue is oddly reluctant to discuss recent enhancements to its flagship database

Nate Amsden

Re: Who in their right minds would invest in new Db2's?

Company I'm with used to use Percona MySQL support, maybe until 2014/2015 or so. They were pretty good. Used their percona MySQL distribution as well. Didn't have to call support often at all. Then one year they increased their prices probably by 700%+ and we dropped them immediately(didn't replace them with anyone else externally anyway). I could see adjusting our budget for a 200% increase in cost but the company wasn't willing to accept that much larger increase especially since we didn't engage them often.

Nate Amsden

Re: Who in their right minds would invest in new Db2's?

What I found sort of interesting in the article is it started out touting Db2 in clouds and stuff, then goes on to focus more on Db2 growth on mainframes. Just gave me the feeling that perhaps the biggest reason one would choose Db2 at this point is if they planned or thought they would need to use mainframes. Only other reason is if you already have a history of Db2 in house(most likely on IBM hardware perhaps AIX or something non mainframe) it makes sense to work with that knowledge.

Otherwise, as others have said IBM has a big hill to climb(I'm guessing they won't even try at this point) to convince folks why they should use Db2 over something else(mainly Oracle I suppose as MSSQL is generally Windows-only(I am aware of the Linux port of MSSQL)). I've only worked at smaller orgs but have only been personally exposed to Oracle, MSSQL, MySQL, and a tiny bit of postgres (that is the one I am least familiar with, and thus least comfortable using) over the past 20 years. I have been told that IBM has a much stronger presence on the east coast of the US than west where I am at, so that could be a big factor as well, I am not sure.

I suspect it's more likely that IBM goes out and acquires someone like EnterpriseDB(assuming they aren't owned by some super big company already I don't know), similarly how Oracle acquired MySQL. Then tout that combination with Red Hat. Leave Db2 for legacy and push Postgres going forward.

Nate Amsden

Re: Throwaway society mentality comes to IT

Was thinking about this more during my workout. Wanted to add another great way to kill Oracle performance is not using bind variables. In fact the latch contention at the 2nd company was the result of a combination of their new ruby on rails app not supporting bind variables and them using a hackish workaround in the DB driver to "force" bind variables on everything, and at one point they introduced new queries which from what I recall caused the query plan to change and things just exploded at that point due to improperly using bind variables(I didn't figure out the cause it was explained to me by the devs who wrote the app after they figured it out). The original app stack they had was based on java and did properly use bind variables, then they bolted a ruby app on and wanted to use the same DB backend (different schema but same instance).

The cause of latch contention at the first company if I remember right was massive use of SELECT FOR UPDATE queries. Triggering major contention in some of the indices. I believe I was told that storage performance was not a factor as the rows in question were in buffer cache already so getting faster storage wouldn't help, getting faster cpus wouldn't help well maybe it would get you a few more seconds or minutes before the issue showed up, but realistically there wasn't a CPU on the planet that could of solved that problem.

Nate Amsden

Re: Throwaway society mentality comes to IT

another common saying among DBAs I have worked with goes along the lines of - you can double your performance by doubling your CPU speed(emphasis on speed not number of cores)/memory, but you can increase performance by 10X+ by fixing your query(ies).

I believe Oracle EE with an addon does provide parallel execution(for a single query), most other DBs do not. Otherwise of course a single query is limited to a single CPU core, and CPU core performance doesn't increase very quickly. You can add more cores but that won't help individual queries just the ability in some cases to run more queries. Our big Oracle servers back in 2004-2005ish literally spent more than 60% of their time waiting on I/O on average. CPU user time was quite small. We literally had vendors like Hitachi come in and tell us nobody else in the world is doing what we were doing with our DBs. They would spec out hardware and said in theory it should be fine but we haven't tested things in this way (in the end it was fine as in didn't break at least until they tried NFS on NetApp and the NetApps shit themselves immediately until they changed to fibrechannel). I wasn't responsible for back end storage or Oracle systems though I did write a ton of stuff to monitor them which everyone relied on even years after I left..

I remember MySQL query cache contention being an issue too. One of our DBAs at my current company almost got fired when we moved out of cloud into our data center and he was adamant he didn't want MySQL query cache on(most MySQL folks hate it for good reason I suppose). The app required it(again, bad app). Performance was in the shitter for hours. He said no. Everyone else said yes. He relented and turned it on app performance skyrocketed and things were ok again. We did have query cache contention at times but it was super rare for it to be a critical issue. Then they built a new app stack that didn't rely on query cache as a performance boost and we have had query cache disabled for many years now.

Nate Amsden

Re: Throwaway society mentality comes to IT

"Badly written SQL? Why bother tuning it, just crank up the number of CPUs on the config page for the instance"

If you say a statement like that it makes me confident you were never a DBA, yet alone an ORACLE DBA for any period of time.

Ever heard of latch contention? Shit I remember dealing with that back in 2004, and again at another company in 2007ish. I remember one conference call in particular during an outage with latch contention. Our VP of engineering asked us, and I'll never forget because it was just such a strange request. He said

"Guys, is there anything I can buy that would make this problem go away?"

I wasn't the DBA (I have never been a DBA though I have managed database servers) or even responsible for back end systems(storage etc), I managed the app layers. The answer was a resounding no. You could buy the fastest computer in the world and it wouldn't solve this issue. You have to fix the app. We had the largest single OLTP database in the world at that company at over 50TB in size(by 2006), Oracle said the next largest was Amazon at ~8TB. The 50TB size was due to poor app design(storing raw XML in the database). Probably one of the only companies in the world when they deployed data warehousing the data warehouse was a fraction of the size of OLTP.

I had an argument with my manager at the 2nd company where we had latch contention. I knew the problem from past experience. I didn't know the CAUSE. Everyone blamed Oracle(except me and the outsourced DBAs). The problem was most quickly "fixed" by doing an Oracle restart. But then within a week it would come back. CPU would go to 100%, query rate would fall to the floor. Oracle Enterprise manager clearly showed latch contention happening.

After about 6 weeks of this the devs finally figured out the cause - a problem in the app. Which they fixed and the latch contention stopped.

I haven't used Oracle DB seriously since 2008, I don't include running vCenter DB on Oracle(it was either that or MSSQL and I wanted a DB running on Linux) for several years as that was a tiny DB with a very low workload. Given the low workload and user count it was super cheap per user licensing for Oracle SE subscription.

I know all of that and I'll never admit to being a DBA. I can generally manage Oracle and MySQL services/backups etc, but getting into the guts of tuning and queries etc is not me.

Another Debian dust-up with Firefox dependencies – but there is an annoying and awkward workaround

Nate Amsden

Re: And that is why…

Not sure why the down votes for you. As someone who switched exclusively to Debian in 1998(from Slackware), I switched to Ubuntu for laptop/desktop use cases in probably 2006ish time frame I don't recall the first Ubuntu version I used(mostly for the drivers). About 18 months after Ubuntu 10.04 LTS went EOL I switched to Linux Mint (MATE) to keep my Gnome 2 UI which I still use today.

I continued to use Debian on my personal servers until Devuan came out I dist-upgraded all my systems to Devuan (never having had the pleasure of using systemd in Debian as the version I upgraded from didn't have it).

I manually maintain my browsers in /usr/local/browser-version where I have firefox esr, seamonkey, and for a while I was running Palemoon too, until they broke all my extensions earlier this year and I switched back to firefox. I run the browsers under different user accounts for perhaps a tad more protection. I run a dedicated copy of firefox esr under a dedicated account for work webmail and atlassian products. Then I have another firefox esr in a VM connected to VPN which is mostly used for internal company services/sites.

The rocky road to better Linux software installation: Containers, containers, containers

Nate Amsden

Re: Confusing mess

Linux user here since 1996. Your story reminded me of when I installed SuSE for a computer I let my sister use when she lived with me back in 2002 maybe? SuSE was one of the early distros to have a real slick installer and interface. I used Debian on my personal systems at the time(today use Devuan on my personal servers and Linux MATE on my main laptop).

Anyway SuSE, linux right. So imagine my surprise when I came home one day to see her running Yahoo IM on SuSE. Yahoo from WINDOWS. I couldn't believe it. I mean she just downloaded the windows installer(she was never much into computers even now I'd rate her computer skills as entry level at best), clicked on it did the install and it worked, even put an icon on the desktop I think.

Obviously I know all about WINE used it off and on for ~10-15 years, but was not expecting any distro to have it integrated to that level(exception may be for Corel Linux perhaps), especially back then.

Even today I find it hard to believe. She had no idea she was installing software designed for an entirely different operating system. And the damn thing installed and ran. I was impressed, shit still am.

Back on topic I've never used snaps or flatpacks or whatever, I immediately uninstall all of that after a system gets installed if it is there to begin with. I can imagine it has some value for some folks out there. The only containers I use are ones I've built myself with LXC (so that means no docker either). Some people get upset they can't run the latest greatest version of some software package, because the distro packaged version X to include and they want version Y because it's better(it may not even make a difference for the user they just see a higher version number and want it). 95% of the time for me I just use version X even if it is older since it is good enough, the trade offs to get to version Y aren't worth it. Snaps etc are for those folks I think.

I gave up trying to pitch Linux as a desktop replacement probably about the time OS X started hitting it big(2007ish?). I can recall of only 1 person that I've worked with in the past 15 years that uses linux on their desktop as their daily driver(my work is generally with companies with in house internet facing website stuff so developers, server/network admins etc etc). Some run windows(mainly the windows IT staff), but seems everyone else is comfortable with Mac. I don't think I could ever use Mac, I tried it a couple times and it's just not for me.

VMware pulls vSphere update that only made things worse

Nate Amsden

slow and steady

I've been using vSphere since ESX 3.5, and vmware GSX since 2004, and vmware workstation since it became a thing, and "vmware" (the original name) going back to 1999 when I used it on linux (I don't think there was a windows version of desktop vmware at that point yet?? not sure). I wish I hadn't misplaced my "Vmware 1.0.2 for Linux" CD I had up until I lost it probably in 2010, that was a nice piece of nostalgia.

Anyway, at my org I stuck to ESX(the "thick" version which I always preferred) 4.1 at least a year past eol before updating to 5.5(didn't even know at the time 4.1 was EOL, HP vmware support told me when I had a ticket open that they don't officially support 4.1 anymore but will try to help anyway), then stuck to 5.5 again until about a year after it went eol(because I wanted to keep the .NET vSphere client more than anything, I say that as a Linux user since 1996, newer vmware UIs are a significant downgrade in almost every respect and I didn't think it could get worse when I first started using the .NET client, I was wrong) before updating to 6.5(where my systems stand currently, still under support until I think Oct next year). Don't plan to update to 7 until after that time, not sure when.. vCenter will probably update in advance(as I have done in the past) currently use vCenter 6.7. But not in a rush.

Partially as a result of that(and another part in keeping the configuration simple) I've had so very few issues with vSphere over the past decade. I mean with at one point over 1,200 VMs in our org(now around 750) I tended to file on average less than 1 support request a year, usually for fairly trivial stuff. Once was kind of serious as our windows-based vCenter 5.5 corrupted itself and kept crashing(blue screen). I had never recovered a failed vCenter server before so engaged with support for a couple weeks on that(in the meantime had no working vCenter for production). vCenter DB ran in Oracle on Linux, the fix at the end of the day was build new windows vCenter system(OS + apps on top) and attach it to existing Oracle DB. That worked fine. I was just scared of potential side effects from doing that process(there were none that recall) never having done it before.

Fortunately I haven't upgraded early to 7 this time around either, seems like it's been a real shit show. My only guess is the shitshow is a result of vmware trying to pivot last minute on vSphere 7 to embrace kubernetes stuff which sapped resources from the general product. I don't personally know or talk with anyone inside vmware so that's just my guess.

I am a super satisfied vmware customer, have been for over 21 years now, though the last product of theirs that truly got me excited was ESX 4.1. I have stuck to only hypervisor+vCenter+workstation. I haven't used(or been interested in) other products like vSAN, NSX, operations manager, bla bla..more complexity and more problems, and much higher costs.

I can't recommend highly enough using Logic Monitor to monitor VMware infrastructure though. Just amazing what insight I can get with that tool and so easy to use. Before I deployed it in 2014 we had a lot of huge gaps in monitoring for our vmware stacks. LM is not a vmware-specific tool either it can monitor many things(almost anything that is API, SNMP, JMX, etc ). What sold me on it initially was the data I could get out of vCenter though. I mention this tool since I've seen a lot of posts over the years of people struggling to use the vmware-specific tools like ops manager and stuff I think to monitor things.

Point is, no need to rush to upgrade to a new major version most of the time.

Survey shows XP lingers on while Windows 11 makes a 0.21% ripple in the enterprise

Nate Amsden

Re: Adieu but not goodbye XP

Curious which Netbook? I have a ASUS Eee PC 1000HE 10.2" Netbook that I just dug up the purchase receipt on and I bought it in May 2009 so 12 years ago. It has been running Windows 7 home for several years, originally came with XP.

I didn't upgrade(windows) it, but installed clean from a Win7 home CD I bought through a friend who worked at MS at the time. I don't use the netbook often(few hours/year), but it is super handy in some situations with it's small form factor. Memory maxed out at 2GB I think and has a Samsung SSD in it now. Battery still seems ok too. My only real complaint about this eee pc I suppose is the weak CPU, being able to get more than 2GB of ram would be nice too but the cpu holds it back more than anything. I have a bunch of old games on it too from GOG that work fine(games that were built for 486s originally)

I also had the original Asus eee PC with linux and a 4GB flash storage or something? That thing was terrible by contrast. Whether it was the screen, or slow storage, or lack of storage, lower memory(I think it was less than 1GB). I gave my original eee pc to someone else 10 years ago.

When the world ends, all that will be left are cockroaches and new Rowhammer attacks: RAM defenses broken again

Nate Amsden

Re: @msobkow

To me, ECC alone hasn't been enough for real servers for a long time. I remember reading this more than a decade ago regarding HP's "Advanced ECC"

http://service1.pcconnection.com/PDF/AdvMemoryProtection.pdf

The document is so old they reference generation 2 servers, of which I was deploying back in 2004 (2005 at the latest) maybe?

from the pdf

"To improve memory protection beyond standard ECC, HP introduced Advanced ECC technology in 1996. HP and most other server manufacturers continue to use this solution in industry-standard products. Advanced ECC can correct a multi-bit error that occurs within one DRAM chip; thus, it can correct a complete DRAM chip failure. In Advanced ECC with 4-bit (x4) memory devices, each chip contributes four bits of data to the data word. The four bits from each chip are distributed across four ECC devices (one bit per ECC device), so that an error in one chip could produce up to four separate single-bit errors."

I've always wondered how well Advanced ECC does against these attacks. I have read ECC alone is enough to defeat them as they stand today, but have not noticed if Advanced ECC has any further benefit beyond regular ECC in this security scenario.

IBM has/had a similar technology called ChipKill:

https://en.wikipedia.org/wiki/Chipkill

(update)

Came across a PDF linked in above article from HP:

http://ftp.ext.hp.com//pub/c-products/servers/options/Memory-Config-Recommendations-for-Intel-Xeon-5500-Series-Servers-Rev1.pdf

Which puts things into plainer english

"Note that Advanced ECC is equivalent to 4-bit ChipKill. Lockstep gets us to 8-bit ChipKill. ChipKill just indicates that an entire DRAM chip can die and the server will keep running.

Negatives of Lock Step Mode:

- You have to leave one of the three memory channels on each processor un-populated, so you cut your available number of DIMM slots by 1/3.

- Performance is measurably slower than normal Advanced ECC mode.

- You can only isolate uncorrectable memory errors to a pair of DIMMs (instead of down to a single DIMM)."

I do remember turning on "Advanced ECC" in a Dell server(was happy to see the option appear in the bios at the time this was back in 2010 I think), however was sad to see when it disabled a bunch of the dimm slots, I assume for fault tolerance. HP has a similar option called something like "Online spare memory" where some banks are kept in reserve(on my 384GB systems it lowered addressable memory to 320GB). I don't know any info on Dell's implementation if it was just online spare memory and they called it Advanced ECC or if it was some other approach. And perhaps they have improved it a bunch in the past decade. (update) I am guessing Dell's "Advanced ECC" was Intel Lockstep.

I have been quite surprised that others haven't come up with similar technology (thinking Supermicro and other smaller players). Or perhaps they have and I'm just not aware of it.

Remember SoftRAM 95? Compression app claimed to double memory in Windows but actually did nothing at all

Nate Amsden

Re: "Windows' registry doesn't need cleaning"

by the time I jumped ship to Linux on the desktop in about 1997-1998(at home anyway) I was reinstalling windows about twice a year(perhaps more). First Win95, then I jumped to NT 3.51(hoping for more stability), then NT4. I was super frustrated with the lack of stability on windows at the time.

My first paying job in 1998 was helping develop "embedded" computer systems for video surveillance that ran proprietary software(built on VB3 I think??) that required Win98, so I had my fair share of time messing with that as well, though only at work. I wasn't doing any software development, it was more of a integration position, I worked on the software end(in collaboration with the software vendor), and another guy worked on the hardware, then the company built custom chassis for them. Years later after I left my co-worker told me every single system they sold was returned at least once for some failure/problem. That was a good laugh.

XP and later windows seemed to last longer in my experience but I haven't seriously used windows as a daily driver anyway since about 2005(and that was only on a work laptop, had XP on it, and replaced the shell with "Litestep", and used a lot of cygwin - the IT staff could never figure out how to use my system). I certainly have had Windows 7 systems that lasted longer than a decade and did not need reinstalling however they were also never intensely used.

Waterfox: A Firefox fork that could teach Mozilla a lesson

Nate Amsden

Re: Palemoon, check. Seamonkey, check.

I used Palemoon for a while, couple years at least(gave some donations too). I held on to Firefox 37? I think it was for as long as I could, tried Waterfox at the time, none of my extensions were compatible(this was years ago). Then I found Palemoon, and it worked with everything.

I had Palemoon as my "main" browser (excluding internal work stuff which runs on a VM), and I used Seamonkey as a dedicated browser for Logicmonitor SaaS monitoring(no real reason other than I just wanted to keep that seperate from browser restarts or whatever).

Fast forward to earlier this year(?), Palemoon put out an update that killed my extensions, every single one was shot down as incompatible. Looking at my /usr/local where I keep my browsers, it was probably Palemoon 29. Palemoon 28 was working fine.

I had been using those extensions, some for over a decade without updates, they were simple, and worked fine, perhaps they were obscure. Extensions such as Live HTTP Headers(miss that one a LOT), Cloud to Butt(made me laugh, miss that a lot too), Old location bar, Prefbar(miss it), RememberPass, Remove It Permanently(miss a lot), Save Link in folder, Tab utilities fixed, Zoom page. Looking at Palemoon 29 now and it lists all of those as incompatible(with the only option given to remove them). At the time I tried finding replacements, found some for other extensions but could not find any for things like Prefbar, Live HTTP headers for example.

Anyway I came across the forum post which highlighted the changes. So Palemoon removed the last reasons I had for wanting to use it (earlier they removed the cookie management that Firefox had withdrawn years earlier, I had more than 10k sites in my firefox/palemoon sqlite cookie permissions db). I understand the reasoning, I'm not mad at them. Just sad to see an end of an era for browsing for me.

So at least for the time being I went back to Firefox ESR and the only addons I have are Tab session manager, ublock origin(never used it till recently), and zoom page. It's certainly a lot faster than Palemoon though I'd take my old extensions back over the performance any day.

I now run 2 firefox instances in linux in different user accounts, one is dedicated to outlook web access and atlassian products for work, the other is my regular browser(I keep the UIs/themes different to tell them apart, and they run on different virtual desktops, of which I use 16). Then still have seamonkey, AND still have another firefox ESR in a windows VM for internal work stuff. I switched off google search to bing several years ago(just because), and noticed when using bing and outlook web access it kept me logged into bing, which I did not want(if I logged out it would log me in again). So I introduced the new firefox instance (on a different account run via sudo) dedicated to OWA and similar work things. Configuring pulse audio to work with these firefox instances running under sudo wasn't easy(had to configure PA for network which isn't the default many head scratching moments at the time trying to figure out why audio wasn't working). Not that I need audio often in my browser just a few times a week at most.

I have read bad things about recent firefox UI changes (those haven't hit ESR yet I don't think), but maybe I will switch again when they do.

My laptop has 48G of ram, I should only need no more than 16G, currently using 12(including a VM that is allowed to use 10GB, though guest is using 4GB). I was "forced" to upgrade to 32G 2-3 years ago because of newer linux swapping bugs which would cause it to swap like crazy when I still had a few gigs of ram left. System would be unresponsive(even with SSD). Older laptop did the same workload with 8G of ram that newer one needed 16 for just to account for newer versions of software. Upgraded again to 48 for no reason, probably will upgrade to 64GB once prices come down a bit again for no reason.

Side note on linux swapping/memory issues I have been tracking what are to me massive memory leaks in linux 5.4 and 5.8(Ubuntu 20), vs 4.4 (Ubuntu 16) this year. Really annoying. I went as far as to install the Ubuntu 16 kernel on a few test Ubuntu 20 systems so the ONLY thing different was the kernel for the exact same workload(systems behind a load balancer), and the memory leaks slowly over time, eventually swapping, then have to reboot. I sort of gave up trying to find a real solution it's just easier to reboot every once in a while. The leaks are in the kernfs_node_cache and buffer_head caches (flushing the caches has no effect on the overall trend). Been using linux for just about 25 years and have never come across this kind of situation before. I see the same trend on another system running backups over NFS, I more than tripled the memory from 2G to 6G and it still swaps every night(I have crons that clear the swap every hour and flush buffers). It never swapped on linux 4.4 or earlier going back the past 6 years.

I read Waterfox was acquired by an Ad company? or something else kind of shady I don't recall off hand, maybe that info was wrong but kind of surprised me when I read that post. I never used it for more than 10 minutes years ago since it didn't work with any of my extensions at the time.

Cisco requires COVID-19 shots for all US staff – even remote workers

Nate Amsden

how to prove it

Just did a search for the text prove, didn't see anyone else asking or answering this Q. But wondering how they prove they are vaccinated? I mean I read a post recently saying the vaccine mandates for Los Angeles county were pretty useless because everyone they knew that was anti vax had a fake vaccine card. I personally got my vaccine through my regular medical group so I'm sure they have a solid record of it. But many vaccines were held.. off the books? I mean in parking lots and stuff. Sure they got a vaccine card with stickers or whatever, but apparently very easy to fake. Not only that but those vaccine cards are fragile and easily lost. I'm generally quite organized so don't anticipate any issues with my card but many people are not. Some places accept digital pictures of the cards, probably even easier to fake that.

Some of the anti vax folks probably would not be willing to even get a fake card they are so against being associated with such a thing I bet.

Not that I care either way, just curious if someone knew of an verification measure other than looking at (possibly fake) vaccine cards, since it doesn't seem possible otherwise. Obviously vaccinated people can still get infected, so if you get covid it doesn't mean you weren't vaccinated already.

I haven't been to any locations that asked to see my card, though I don't go to many locations anyway (covid or not)

Tight squeeze: Dell shrinks PowerEdge tower server from 117 grapefruit to 74 grapefruit

Nate Amsden

has dell decided to do away with LCDs on servers?

Always thought that it was super creative on Dell's part to put real LCD screens on their servers. I bought a R230 a few years ago for personal use, put it in a colo, and had the LCD screen just print my name and phone number on it. I think it could do a bunch of other stuff like show diagnostic codes and stuff. Bought a refurb R240 not long ago and was sad to see it had no LCD. At first I thought it was probably an optional component and just not included in the build I got. But as far as I could tell there was no option for LCD on R240.

Looking at the R250 and even R350 at least in the technical guides there is no mention of LCD and the pictures show no LCD either. Though it does look like the R450 has an LCD option.

Running a recent Apache web server version? You probably need to patch it. Now

Nate Amsden

probably don't need to patch

Would be surprised if more than 0.01% of apache servers out there run that latest(affected) version. Ubuntu 20 for example runs 2.4.41. Can't remember the last time I felt a need to upgrade apache(as in to get some feature or specific fix for an issue I had), I mean it's done everything I need going back to what was it 1.3 version or maybe even earlier. Last time I built apache from source was probably late 90s.

Supply chain pain: Cisco's base price structure moving north from November

Nate Amsden

supply chains could soon crash

There's been some reports(won't link directly in case that upsets el reg, easy to do a search on it though) that the millions of workers(I think I read something like 60 million workers across the industries) that work in the supply chains are near a breaking point, some being unable to leave their ships in 18 months, others having to get vaccinated 2-3-4x with different vaccines for different countries rules. Lack of pay doesn't seem to help either.

VMware to kill SD cards and USB drives as vSphere boot options

Nate Amsden

Re: Poor training....

apparently writes aren't the only issue

https://kb.vmware.com/s/article/2149257

"High frequency of read operations on VMware Tools image may cause SD card corruption (2149257)"

dates back to 6.0 and 6.5

other issues

https://kb.vmware.com/s/article/83376 Connection to the /bootbank partition intermittently breaks when you use USB or SD devices

(note applies to 6.7 too with no resolution available)

https://kb.vmware.com/s/article/83963 Bootbank cannot be found at path '/bootbank' errors being seen after upgrading to ESXi 7.0 U2

probably others too just 3 that I saw in a recent thread elsewhere.

Also vmware seems to be suggesting, perhaps requiring over 100GB of disk space for the boot disk, which probably factors into their decision to stop SD/USB support:

https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.esxi.install.doc/GUID-DEB8086A-306B-4239-BF76-E354679202FC.html

* A local disk of 138 GB or larger. The disk contains the boot partition, ESX-OSData volume and a VMFS datastore.

* A device that supports the minimum of 128 Terabytes Written (TBW).

* A device that delivers at least 100 MB/s of sequential write speed.

* To provide resiliency in case of device failure, a RAID 1 mirrored device is recommended.

Not sure if there are any USB or SD cards that have 128TBW life spans and 100MB/sec sequential write speed.

I have yet to touch vsphere 7 myself, maybe next year.

Nate Amsden

Almost never liked the thought of usb/sd boot

I remember back when ESXi first came out and everyone was touting SD card and usb drive booting. Servers started coming with internal(??) SD card slots and stuff. Company I was at at the time deployed some using USB sticks I think and had failures pretty quick(had a failure within 4-6 months). At that point I realized I really didn't like the thought of the boot device for a $10-30k+ server being reliant upon such a cheap piece of crap for a boot drive.

I looked a few times but could never find reviews or rankings of higher endurance usb drives/sd cards (perhaps that changed in recent years). HP (and probably others too) came out with a dual micro SD(?) USB stick at one point, I inherited 4 servers that ran that, and went through the associated recall (https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c05369827). Add to that as far as I could tell it was not possible to tell the status of the individual SD cards, you'd only know if both failed. Dell has a BOSS(?) card that sits in a PCI slot I think that uses NVMe drives, sounds pretty neat.

I realize it worked fine for many people for many years. Past 10 years all of my hosts have been fibre channel boot from SAN. Except my personal esxi hosts which use local SSD storage.

Got enterprise workstations and hope to run Windows 11? Survey says: You lose. Over half the gear's not fit for it

Nate Amsden

Re: "an upgrade will have to happen in the coming months or years"

a good reason to change the version is breaking changes like these new hardware requirements. Imagine how many more would be upset if they couldn't upgrade to the next Win10 because so much hardware was retired from support(which has happened to some users over the years at some points I recall articles popping up), but obviously not this scale of retired hardware.

hopefully Win10 from here on out behaves more like LTSC as in minimal feature changes, as MS shifts to focus on Win11.

With just over two weeks to go, Microsoft punts Windows 11 to Release Preview

Nate Amsden

Should users care much?

I mean why would someone really WANT to upgrade to Windows 11? Windows 10 is supported till 2025 or something? Hell there's still a lot of people on Windows 7. I've been using Windows myself since 3.0 (w/Multimedia extensions!) and back in the 90s(HP DOS 4 before that) there was quite a bit of excitement over windows launches. But really over the past 15+ years there hasn't been. I was an early adopter of Windows 95 beta obviously it was a big step up from Windows 3.1. That turned out to be not so stable(even on release) so I moved to pirated NT 3.51 then NT4 for my home desktop.

I left Windows behind as my primary OS for Linux around 1998 but still use it mostly for work on a daily basis (in a VM, I also manage a few windows server 2012R2(gasp! it's not Server 2019! but they work and still get updates! oh and one 2008R2 system no updates there but works fine) VMs on top of the ~700 Linux VMs).

Point is at the end of the day Windows 10 isn't going anywhere. Software will still work for it for a long time, hell vendors are still providing support for Windows 7, though not sure how much longer. AV software, web browsers etc still getting regular updates at least. Expect the same for Win10 through at LEAST 2027 I'd say. It's 2021, your hardware will be realllllllllly old by 2027 if it's not new enough to be certified by Win11 today (I assume none of mine is though haven't bothered to check since I have no need to).

Stop stressing out about these hardware requirements, do yourself a favor don't worry about Win11 for another 2-3 years at least.

Java 17 arrives with long-term support: What's new, and is it falling behind Kotlin?

Nate Amsden

stupid DNS cache forever default not fixed

I shouldn't be surprised but still, I am. First encountered this probably JDK 2 era before 2005, with our own app stack. Then hit again 5 years later by this behavior on one of the largest credit card payment processors(at another job not at the payment processor was at a company that was a customer of the processor).

By default java will cache positive DNS responses forever, until the JVM is restarted. The justification in the java.security file (same as it is in JDK 17 just downloaded it to check on Linux) is that caching forever protects against cache poisoning attacks. I mean anyone that knows DNS knows that you really have no way of knowing whether or not a given DNS response is poisoned or not(for any that mention DNSSEC note that doesn't apply to client DNS requests). I mean what if the JVM caches the poisoned response? Also makes for trouble in clusters where JVMs may not be restarted all at the same instant maybe different VMs will get different DNS responses depending on when they restarted.

So many problems with caching forever and it just blows my mind that is still a default configuration. Worse still many java customers(even large enterprises I have come across in my experience) don't know about this setting and just leave it at the default because they don't know any better. Makes it worse still when they are actually scared to change the setting so they leave it at the default.

see the setting for

#networkaddress.cache.ttl=-1

in conf/security/java.security for reference

It wouldn't be so bad if they WANTED to keep the default then to throw up a warning when the JVM starts saying HEY take a look at this setting you may want to change it. Then perhaps add a jvm startup option or something to suppress that warning for those that REALLY don't want to change the setting.

The biggest part of the problem is people just don't know this setting is even there for the most part, and many are stupid so they see the warning in the file and say hey that makes sense I should keep the default setting.

Or the least they could do is change the comment in the file to say the previous justification was very wrong and caching forever is a bad behavior but we keep the default like this because we want backwards compatibility to be perfect or something. But say as a customer you SHOULD change this value.

(Note I haven't fired up anything with Java 17 so they may very well have inserted a warning but I kinda doubt it)

AT&T Alien Labs warns of 'zero or low detection' for TeamTNT's latest malware bundle

Nate Amsden

monitor CPU usage too

Several years ago I was called in to help with another team's compromised Wordpress website. At the time it was infected with a bitcoin mining thing. The only reason it was detected was because it shot cpu usage through the roof, which then tripped some alarms I think. Of course the system was lacking many updates which led to the compromise, but the point being crypto miners being installed I would expect CPU usage to go way up.

On all of my ~800 VMs I have monitors that say if CPU usage(as measured by vCenter, monitored by Logic Monitor) is greater than 75% for an hour then send an WARNING alert, if greater than 90% for an hour send a CRITICAL alert (neither alert goes to pagers just to pager duty alert dashboard and email). Not to track crypto currency miners of course just general usage indicating something could be wrong with the software running on the system if it is using CPU for that long.

It's pretty rare on my systems at least that the CPU would be pegged for more than an hour for a normally functioning application(I guess if that was normal then it would signal to me that the system needs more CPU resources), though it does happen from time to time.

ProtonMail deletes 'we don't log your IP' boast from website after French climate activist reportedly arrested

Nate Amsden

"If they weren't logging those IP addresses and connection strings, there was nothing to seize."

It sounds like they have the ability to log based on user account. So perhaps while they don't log normally, if such a request comes in that they have to get the IP then they can flip a flag in their code/config to start logging for that particular user account, then assuming the user logs in again they have the information.

If you are that paranoid about hiding your IP etc then you shouldn't be trusting a single provider like this, should be routing traffic over multiple different places to further obscure your information, and not wait for some news event like this to start doing it. Also of course use a dedicated browser that is not used for anything else except that service, if your even more paranoid perhaps use a dedicated VM with that browser.

Seeing the anonymous relay service they offer in the article reminds me of my early internet days using the I think it was anon.penet.fi (??) email relay, sometimes took days for email to be processed through that. I have been hosting my own personal email(around 350 different addresses for different purposes at the moment) since about 1997(along with web, DNS and anything else I want). Though of course doing that is not for 99.999% of people out there.

VMware shreds planned support for 'cheese grater' Mac Pro

Nate Amsden

Re: Why were they even thinking about ESXi on Mac Pro?

Can't tell but I think you missed the obvious use case of being able to legally run lots of OS X VMs on top of the system since it is really beefy in hardware vs the other options. Though have read reports OS X runs like crap in a VM(crap as in user interface very slow etc), am not sure if that is because they were using unsupported hardware or if the OS just doesn't play well without say mapping GPUs to the VMs (something ESXi can do but as far as I know the desktop/workstation VMware products cannot - personally have never had a need to map a GPU to a VM, though understand that VDI is the main purpose for that).

What I'm curious of(not that I have a need for it) is does ESXi work at all? Is this them just saying they won't support it, or were they actually getting drivers together etc for it to work in the first place. Since there seems to be lots of folks out there running unsupported ESXI configurations without formal support and are happy. Hell even the free ESXi doesn't include support.

Given it is a Xeon system I would be kind of surprised it wouldn't be possible to build a configuration that didn't work with ESXi being that you can add in (I assume) another NIC or something in case on board is not supported since it has a bunch of PCIe slots.

Component shortages: HPE pushes up some hardware prices and as if by magic, reports 'record' gross margin in Q3

Nate Amsden

Re: Silicon shortage

Edge case for sure, but I was looking into updating end of life for equipment for the company I work for. I had thought some of our Extreme Networks X670V 10G switches were going end of life next year, but I read the end of life notice wrong(one specific version of the switch goes EOL next year). Turns out they are going software end of life in 2023, and hardware end of life 2026. Which just blows my mind considering we bought some of them in November 2011(blows my mind as in I expected a shorter life span). I knew they were new at that time but I looked into it further and it turns out we bought them just 4 months after they hit the market. So just lucky timing to some extent I guess. I do have several 1G Extreme switches going EOL mid 2022, some of which will hit 10 years of operation in mid December.

I know 10G to many is old school, but at this rate I could run 10G for another decade and not need to upgrade. 40G uplinks from the 10G switches, super low utilization. Dedicated fibre channel network for storage.

By contrast, unlucky timing when our IT department bought a pair of Arista DCS-7050SX-72-F 10G switches several years later, they only had about 3-4 years until end of life/support at the time. I can only assume whomever sold them screwed them in that manor to get the cost down by selling an older model.

Until this company I hadn't worked at any company longer than about 3 years so never really had to be concerned about end of life. Here I am in this position for just over 10 years now, so end of life is certainly something I have to deal with now.

I have read bad stories mainly about EMC (though I'm sure others do too) jacking up the renewal prices for things to encourage customers to upgrade to the point where renewal support can cost almost as much as buying new. I haven't experience that with HP at least, going into the 5-7th year of support on several 3PAR systems and support costs have been fairly consistent. We do have an EMC Isilon system as well(very small) bought 3 years of support originally but moved to hardware only 3rd party on year 4, the EMC renewal cost wasn't too bad at that point just needed to cut some costs.

Cloudflare says Intel is not inside its next-gen servers – Ice Lake melted its energy budget

Nate Amsden

Re: Is Intel the new IBM?

AMD thought ARM was the next big thing for servers too at one point

https://www.theregister.com/2016/01/14/amd_arm_seattle_launch/

Cloudflare and ARM

https://www.theregister.com/2017/11/08/qualcomm_centriq_2400/

seems every time ARM gets close to servers it's opportunity is killed. I guess one exception area is for those that are really building their own systems and designing their own chips(and such systems aren't available to purchase by anyone else). Don't really have much hope for ARM on servers in the general market.

It seemed to me as hard as it was for x86 to scale down in power(for mobile) it was as hard for ARM to scale up in performance/scale(for servers), to the point where the costs/power didn't really make a big selling point for ARM at that point. I'm guessing Qualcomm saw the margins Intel was getting and was hoping they could do similar with their server chips but pulled the plug when they realized they would not get those margins(most likely due to competition, and the cut throat nature that the big hyperscalers play by)

Companies tried to work around that to some degree for a while with micro server designs but those never really went anywhere(AMD bought one such vendor Seamicro which looked like a really neat system at the time, and HP still has their Moonshot system though it's been 4-5+ years since I heard them talk about it).

IBM Cloud took the evening off – 23 services were hard to provision for eight hours

Nate Amsden

a relief

When I moved my current org out of public cloud a decade ago the costs were the primary selling point to management (ROI of 8 months according to my manager at the time). But for me it was not only costs but it was less headache having complete control over the infrastructure(aside from power and internet, our internet provider has a 100% SLA for their network, haven't suffered any power issues at our main data center since we moved in, a data center we used in Amsterdam had minor power issues a few years ago but everything with dual power supplies stayed online, and we don't have anything hosted in europe anymore). Despite 100% SLA on the network of course there are caveats, things like DDoS related outages are not covered(fortunately they are super rare, we've never been attacked but have on rare occasion been collateral damage).

Random outages, performance problems, general WTF moments, seemed to be endless points of stress when using public cloud because in part they are always messing with it. And those were just the "small scale" issues/anxiety that don't make news stories like this one. Things like this it's hard to put a $ figure on. You can work around some of it by making your app super resilient, but as a recent cloud/devops survey reported here on el reg, most companies have not done that(as I predicted as much back in 2010). It's not easy or cheap to do.

Maybe we have been lucky I am not sure, but can count on probably one hand I think the number of hard production VM server failures we've had in the past 5-6 years(oldest production server probably from 2015, oldest non production 2013). Zero production storage failures(oldest array online since 2014), zero networking(oldest networking device online since 2011). Just so damn reliable, I mean it's even exceeded my expectations(by a big margin). When a VM server fails almost every time the VMs are auto restarted elsewhere faster than even the alerts can come in notifying they are down. Super rare to need manual intervention on anything during that kind of event, all state retained. No data lost(other than perhaps some inflight transactions).

Of course outside of a few folks this level of reliability is not well recognized in my opinion(as is expected I suppose). A decent part of the success in my opinion is keeping things simple. The more complex you get the more likely you hit bugs. Of course if you have really good testing you can probably catch them early, but no place I have worked at in the past 20 years had really good testing.

I recall one quote from a QA director many years ago at my first "SaaS" (before that was a term) company who said in a meeting something along the lines of "If I had to sign off on any build going to production we'd never ship anything". Another quote from my time there from the director of engineering during a big outage "Guys, is there anything I can buy that would make this problem go away?" (the answer was no, and that answer didn't come from me I was just following along). Fun times.

DevOps still 'rarely done well at scale' concludes report after a decade of research

Nate Amsden

Re: The problem with DevOps . . .

"In my experience, the main problem with DevOps is that it was conceived of by programmers who went to top-tier engineering schools and thus have a reasonably deep background in programming theory and practice and a great deal of comfort with reducing projects to a series of discrete tasks."

That's not devops, that's agile. (emphasis on reducing projects to a series of discrete tasks).

Devops I think came more from programmers who were tired of being "held back" by operations folks(such as myself perhaps being in ops since 2003, and internal IT before that). They wanted more control over things end to end. Which if they know what they are doing that can be effective.

More often then not(90%+) they don't know what they are doing from an ops standpoint and so the result you get is shit(shit composed of massive complexity, tribal knowledge and huge costs associated with cloud technologies).

I literally don't even need two full hands to count the number of developers I've worked with over the past 21 years that had a really good grasp at ops. All of those that knew ops were excellent, worked really well with them. Many other developers know they don't know operations and establish a trust relationship with the operations team to defer to their expertise on those topics, they were great too lots of mutual respect there. Then there are the devs that think they know it all, when they don't and just make trouble. Certainly there are ops folks out there that are similar in those categories as well though in my experience again(limited I suppose to smaller companies) the issue is much more on the dev end of things. Also if you have ops people who don't know what they are doing that can be a big issue too. I've run into several system and network admins over the years who are clueless at times and make it worse by not admitting to it(just like the dev end of things with ops related stuff). I guess the point is you need someone(s) with good operations experience/knowledge(depending on your scale of course) and they are super rare.

At the same time the number of ops people that have a really good grasp at dev stuff is low as well. But I don't typically don't see the ops folks getting their hands dirty on the dev end of things(or even trying to get their hands dirty). If we're approached for an opinion or are in some meeting on design or something we can provide input but ops folks don't generally try to dictate application design and stuff that's not our thing. I remember the first development architecture meeting I attended as an ops person back in 2003. I practically fell asleep and wondered why they hired me at the time so many terms I had never even heard of (enterprise java application). Super complex app stack the most complex I've ever dealt with even to today. I ended up being the company wide expert at that stack I knew everything and everyone knew it, quite a double edged sword burned out hard there. But had a good time and learned a lot too.

Devops isn't for everyone(as it is defined most recently - not for me either), too bad so much marketing bullshit is behind it and other things like public cloud that many people feel compelled to try to adopt it, often with poor results. Same goes for agile as well.

Speaking of cloud usage, both the company I am at now and the previous company started out in public cloud. There was no "lift and shift", as in they had nothing to lift from. Day 1 was in public cloud at both places(app/DB/server/etc designs were set before I started in both cases). I/We did lift and shift the current company out of public cloud(including waiting many hours for a mysqldumps from RDS mysql instances) a decade ago and as a result have been saving easily over $1M/year every year since.

Previous company's(now defunct) board didn't want to move out despite $1.6M savings in the first year alone(had support from all levels of the company but CEO and CTO didn't want to fight the board). Over the years several executives have tried to push for public cloud again as it sounds cool but they couldn't make the math work even remotely.

Microsoft abandons semi-annual releases for Windows Server

Nate Amsden

now if only

They do the same for the desktop windows. Make ltsc the default. If you really want new features use that special version.

They have win10 ltsc now of course but they don't make it easy to get(i think unavailable outside of enterprise?) And it seems to cost about double. The versioning and editions are so confusing now too unless you're hyper focused on MS products.(coming from someone who used to run NT3.51 and 4 back in mid 90s until switching to linux full time probably some time in 97).

Though i do still use windows regularly (mostly in VMware) probably 10 to 15%of the time. Have one win10 ltsc VM probably add another soon.

Security breaches where working from home is involved are costlier, claims IBM report

Nate Amsden

assuming they don't differentiate

Between working from home full time, and working from home occasionally. Key point being even if you are working from home occasionally(as in you may be based in the office 5 days a week but sometimes login to VPN when your not in the office to do something) the infrastructure needs to exist to provide remote access, so things are more vulnerable since those remote access devices have security issues of their own on occasion.

While several organizations are moving workers back to offices I don't think (m)any will be removing remote access entirely across their employee base.

Rackspace literally decimates workforce: One in ten staffers let go this week

Nate Amsden

Re: beginning of the end

I'd wager the beginning of the end was probably close to a decade ago. Might of been when they did their initial big investment for OpenStack. Not that OpenStack was a bad idea at the time(had a lot of promise though to me promise has fallen flat the past 5-7 years mainly due to complexity), but it was a serious shift in technical strategy.

https://www.theregister.com/2010/07/19/nasa_rackspace_openstack/

Couldn't find the article(think there is one) of when Rackspace was essentially pulling out of OpenStack.

I was never a customer of theirs but did price their stuff out on a couple occasions ~10 years ago the cost never made sense. Not that public cloud is any better(actually worse in many respects but some still eat it up because it's sexy I guess). Absolutely astonishing how much money is wasted in public cloud, just makes me sad.

Akamai Edge DNS goes down, takes a chunk of the internet with it

Nate Amsden

took them a long time to acknowledge it

As an affected customer it took our stuff out at about 8:34am pacific time. I checked their status page, everything looked fine but DNS was not after several manual attempts to query their systems. Tried to call support, queue was full. Tried to do support chat, was immediately disconnected (that surprised me I expected to be put in a queue even if I was #8590283 in line), tried to file a support ticket, internal server error. Once I saw that, I hung up the phone obviously others were reporting the issue to their support.

Given their support systems were overwhelmed I'm surprised they were unable to update the status page of their site to show an issue was going on.

They have a community support page, and that didn't get a post till about a half hour into the incident, and they didn't even get to email me that there was an issue until two minutes after it recovered(9:39am for us, email came in at 9:41am pacific time). Same with their status page the outage was going for about a half hour before it was updated.

Don't mind the outage, but would be nice if they could get their status page closer to real time status, should have it updated say within 5 minutes of a major disruption like this?

If companies really cared about a CDN provider going down because it does happen the obvious solution is multiple providers, but not many organizations are up to doing that. Though it's significantly easier than using multiple data centers or for those in public cloud multiple cloud providers. Same goes for DNS providers, nobody is forcing you to use a single provider. If it means that much to you then use a 2nd one(or a 3rd), again it's quite simple (but most orgs don't care enough to do it). I recall noticing Amazon was using Dynect about 11 years ago now for the first time(they were UltraDNS only before). And my Dyn rep at the time said they signed up one Q4 after UltraDNS had a big outage. Seems like today they still use both of those providers at least for their main domain. Meanwhile microsoft is bold enough to rely on their Azure DNS for their main domain.

White hats reported key Kaseya VSA flaw months ago. Ransomware outran the patch

Nate Amsden

Re: Inside info?

You failed to mention the fact that if such a critical bug was reported publicly without "responsible disclosure" yes the bug would of been fixed faster, but it is much more likely that such a bug would be exploited even faster(faster than the bug could be fixed). I have no idea how this VSA software even works but even if a patch was released fast would the customers have been quick to patch? (or are their patches applied automatically?)

You can see this in real time right now with the "print nightmare" stuff from MS. Fumbling about releasing patches that don't fix the issue and cause other major issues(reports of people not being able to print with certain kinds of printers) etc. And in that case the disclosure of the bug if I recall right was an accident with the reporter thinking it was already fixed.

It is very unfortunate though that security plays such a low priority in software development for the vast vast majority of organizations out there. Add to that security plays such a low priority in the operation of such software in the vast majority of organizations out there just look to how many times there are reports of compromises because of some issue that had a patch released but never applied, assuming they were even aware such software was in use(if you are running an insecure vpn appliance it should be obvious(obvious = patches available from vendor), but if you have code running insecure libraries it may not be obvious). Or even worse, organizations that expose systems such as databases directly to the internet, or "cloud" file shares that are meant to be private.

I don't know what the solution is, if there even is one. The cost of security issues hasn't gotten to the breaking point where companies are willing to invest more in security seems like anyway.

Intel sticks another nail in the coffin of TSX with feature-disabling microcode update

Nate Amsden

what kind of workloads use/used TSX?

Am assuming probably greater than 90% of workloads never use TSX, but am curious can anyone name an application or type of workload that did? I came across this blog post that explains what TSX is https://software.intel.com/content/www/us/en/develop/blogs/transactional-synchronization-in-haswell.html But to my brain it doesn't give me any clues as to being able to name a software application that might take advantage of it.

Some kind of database? HPC maybe? media encoding? super obscure custom in house apps?

Do you want speed or security as expected? Spectre CPU defenses can cripple performance on Linux in tests

Nate Amsden

disable mitigations in OS - probably doesn't override firmware mitigations?

Most of my systems came from before the Spectre stuff so I haven't installed the firmware updates that have those fixes in them(read some nasty stories about them most recently the worst of them here https://redd.it/nvy8ls), I have seen tons of firmware updates for HPE servers that are just updated microcode, fixes to other microcode, implying some serious issues with stability with the microcode.

I have assumed linux commands to disable mitigation operate only at the kernel level and are unable to "undo" microcode level mitigations.

On top of that on my vmware systems(esxi 6.5) I have kept the VIB(package) for microcode on the older version since this started. The risk associated with this vulnerability is so low in my use cases(and in pretty much every use case I've dealt with in the past 25 years) it's just not worth the downsides at this time. I can certainly understand if you are a service provider with no control over what your customers are doing.

I can only hope for CPU/BIOS/EFI vendors to offer an option to disable the mitigations at that level so you can get the latest firmware with other fixes just disable that functionality. Probably won't happen which is too bad, but at least I've avoided a lot of pain for myself and my org in the meantime(pain as in having VM hosts randomly crash as a result of buggy microcode).

I do have one VM host that crashes randomly, 3 times in the past year so far, only log indicates that it loses power sometimes 2-3 times in short succession(and there is 0 chance of power failure). No other failure indicated, not workload related. HPE wants me to upgrade the firmware but I don't think it's a firmware issue if dozens of other identical hosts aren't suffering the same fate. They say the behavior is similar to what they see in the buggy microcode, but that buggy microcode is not on the system. So in the meantime I just tell VMware DRS to not put more critical VMs on that host, as I don't want to replace random hardware until I have some idea of what is failing(or at least can reliably reproduce the behavior I ran a 72 hour full burn in after the first crash and full hardware diagnostics everything passed), sort of assuming perhaps the circuit board between the power supplies and rest of the system is flaking out but not sure. The first time it crashed so hard the iLO itself got hung up(could not log in) and I had to completely power cycle the server from the PDUs(personally never happened to me before), iLO did not hang on the other two crashes. Server is probably 5 years old now.

Another Q is if version "X" of microcode is installed at the firmware/BIOS/EFI level, and the OS tries to install microcode "V" (older), does that work? or does the cpu ignore it(perhaps silently?). Haven't looked into it but have been wondering that for some time now. I'm not even sure how to check the version of microcode that is in use(haven't looked into it either). Seems like something that should be tracked though especially given microcode can come from either an system bios/firmware update and/or the OS itself.

Microsoft loves Linux so much that packages.microsoft.com has fallen and can't get up

Nate Amsden

never rely on external systems for critical repos

In the post shows someone saying they have a customer experiencing a big outage because they can't download these files? Hosting your own repo even if it is in your "cloud" account has been a thing for more than 15 years now. I am just at a loss for words why people continue to depend on these external sources when they should be mirroring whatever is critical for them inside their perimeter so they have control over it. The sheer laziness of folks is just amazing. Don't get me started on "oh just run this command that downloads a shell script, pipe it to a shell and run it as root to install the software" people. shoot me now, please.

Debian's Cinnamon desktop maintainer quits because he thinks KDE is better now

Nate Amsden

Mate is great

Not sure when I first switched to Gnome 2, though I was using Afterstep for many years in the late 90s and early 00s (on Debian), perhaps I jumped to Gnome 2 from that mostly on Ubuntu maybe starting mid 00s. Then Ubuntu and the Gnome team separately both taking similar drugs decided on radical changes. Fortunately there was enough people to start MATE and Mint(not sure when Mint's first version was). I jumped ship from Ubuntu about a year after 10.04 LTS went end of life to Mate 17, and now on Mate 20(installed fresh last year).

Really nice to have a stable user interface for about the last 15 years now. Though I may only have ~5ish years left one of my key bits of software I use with Mate is called brightside for edge flipping on virtual desktops(never been multi monitor, always been virtual desktops my regular laptop uses 16). brightside apparently isn't maintained anymore, the last version I could find was for Ubuntu 16.04.

After a couple hours of work I was able to build it cleanly on Mint 20 (Ubuntu 20) and it works fine, but several of the libraries it uses are past end of life(and I had to hack some stuff into the code/configs to get them to build) and am certainly concerned 5 years down the road when I upgrade again will it even work anymore. Then there's the whole Wayland thing what will that be like, guessing brightside from 2014 using X11 protocol probably won't work too well on that. Been using edge flipping since my days with Afterstep which was/is a master at virtual desktops(WindowMaker too I'm sure I used Afterstep at the time to be different I guess, later used LiteStep on WinXP for my work system in mid 00s). Mate works fine without brightside but I switch virtual desktops often times several times a minute so having that functionality is critical. I saw some alternatives before I went down the road of building brightside myself none seemed to compare from what I recall.

Only annoying bit is the marco window manager continuously loses the "mouse over activation" ability(another critical bit for me), started a few years ago was hoping it would be fixed in Mate 20 but it is not. I have a little button on the screen that I press to reset marco(doesn't cause any data loss) and it works again for a random amount of time.

Never was into Cinnamon or Gnome 3, I have had Gnome 3 on my home Debian "server" (only has a GUI to show either calm videos in a loop in VLC or a slideshow) over the years and think it would be too painful to use day to day.

I used KDE back in the 90s for a while I remember building it and QT from source many times, pre 1.0 stuff. Not sure why I stopped using it

Anyways thanks to MATE/Mint folks..going to go donate again now.

Cloudflare network outage disrupts Discord, Shopify

Nate Amsden

Re: CDN useless

Not sure where you are coming from but it has been common practice for CDN to terminate SSL for over a decade now(probably much longer). Most(maybe all) of the major CDNs are PCI compliant as well(contacted several last year as I was expecting to have to jump CDNs again our previous CDN went out of business early last year). So they have visibility into everything traversing them from a protocol perspective anyway. Even if you encrypt individual files to transfer they can still be cached in encrypted form since the CDN will see the raw data as it decrypts the SSL/TLS on top.

Really can't imagine many customers out there not trusting their CDNs to decrypt the traffic. Servers are faster but in my experience at least servers have rarely been the bottleneck when it comes to traffic, servers are eaten up by app transactions. It's origin bandwidth and latency that CDNs help in the most simple use cases. Not too uncommon to get more than a 90% reduction in origin bandwidth with CDN.

But they can do more if your developers are willing to leverage them, one useful function several provide is automatic image resizing. Tried to get the devs to use it at the org I am at for years but they never wanted to, instead they wanted to store ~15 copies of each image(pre generated in advance regardless if any of those copies would ever get used) in different resolutions, just a waste of resources, made worse seeing some images on the size be super sized only to be reduced dynamically by image tags in the browser.

CDNs do offer a nice protection from (D)DOS attacks as well at least some varieties of them just because they have such massive capacity.

CDNs certainly can go down, so for those that is super critical that their CDN does not go down then use multiple CDNs either dumb round robin DNS or use an intelligent DNS provider that can do health checks on the backend and automatically re publish DNS entries to point to an alternate provider(in the past I was at a company that did this not with CDN but with our own multiple backend systems(app stack was entirely transactional no static content nothing could be cached) and we kept the TTLs to 60s or less I believe using an anycast DNS provider this was ~11 years ago. Prior to that they used BGP to fail over between sites but that was quite problematic so we changed to DNS failover).

Linux 5.13 hits rc5, isn’t yet calm, Linus Torvalds is only mildly perturbed

Nate Amsden

Re: Still brickin'...

Very confused.. as a Debian user (well until switching to Devuan) since 1998, if you are not familiar with Linux and are not looking to get familiar with it, Debian is nowhere near the top of the list of distros you should use. Really only more technically inclined people would of even heard of it.

Even myself I ran Ubuntu on my laptops for several years until 10.04 went EOL then switched to Mint. I run Debian/Devuan on my personal servers (have about 650-700 Ubuntu servers for work).

So you really set yourself up for failure. That is unless you were looking to dig in and learn about things and fix it or find compatible hardware which it didn't seem like you were in the mood for.

Myself when I first setup Linux back in 1996 I chose Slackware(3.0 I think?) specifically because it was more involved to use than Red Hat (the most common distro at the time) and I wanted to get into the deep end. And I did, downloading and compiling tons of things over the early years from source whether it was the kernel, libc, glibc, X11, KDE, Gnome etc etc.. learned a lot. I don't do that too much anymore though. And stay far away from bleeding edge kernels. Last time I installed a kernel directly from upstream was in the 2.2.x days(back when there was a "stable" and "unstable" branches of the kernel once that stopped then I stopped toying with things at that level).

Hell I just started trying to dig into finding why there seems to be some major new memory leaks in linux 5.4 and 5.8 (Ubuntu 20.04) that didn't exist in Linux 4.4 (Ubuntu 16.04). First time really looking at /proc/slabinfo and /proc/zoneinfo in 20+ years of linux usage, hopefully something useful comes of it. Have never noticed this kind of memory leak in the kernel before, my use cases are very typical, nothing extreme so don't encounter problems often.

AWS Free Tier, where's your spending limit? 'I thought I deleted everything but I have been charged $200'

Nate Amsden

downhill

It's getting worse? really? wow.

Quick story - back in 2010/2011 I worked for a small startup in Seattle. The CEO's brother was the head of Amazon cloud(now the CEO of Amazon I guess). I met with him and my small team at the time, gave them our list of complaints and their response was basically yeah we know that is a problem and we are working on it(manager at the time said it was their typical response). On paper the startup had upwards of $500k/mo bill with them. I don't know how much, if any was forgiven on the backend given the close relations with the executives(though they did direct us to cut spending by as much as we could got it down to maybe $250k?? per month? -- so $3M/year - actually pushed a project to move them out of the cloud which had a ~6 month ROI but the company board didn't like it despite all management including CTO and CEO being on board they didn't want to fight the board for that).

Anyway, the core part of the story, my director(new guy after original manager left) had a history of working AT amazon for more than a decade. Everyone at the company(especially me) hated their cloud. Non stop problems, outages lies you name it. So my director reached out to their support(keep in mind they were just a few miles away) and said HEY, we spend a lot of money with you, have a lot of executive tie ins between us, and we're in Seattle just like you are. Everyone here hates your cloud. We must be doing something wrong, maybe many things wrong. Can you come on site and help us out.

Their answer? No. Not their model, tough shit. Your problem.

I really struggle to think of any vendor on the planet if you are spending half a million dollars a month on calls to complain and ask for help they would have someone on the plane(if required) the same or next day without question. I remember Oracle flying on site to one of my employers to help diagnose an issue. I recall EMC was a couple hours away from flying someone on site to that same company to fix another issue(which ended up not being an EMC issue at all but a bug in the script the storage person wrote, I remember that call with EMC they were practically panicing to get our processes going again after said storage engineer fucked up the script and went on vacation immediately after). That company spent a FRACTION of the $ per month on that stuff.

Amazon told us to fuck off. Kind of needless to say my director(again he worked at amazon for 10+ years and we had many ex-amazon employees working) was quite surprised at their response.

I left not long after and the company I have been at now(hired by first manager at previous startup) have been saving over $1M/year by moving out of Amazon cloud(bill was over $100k/mo in late 2011/early 2012 for an app stack that launched from day 1 in their cloud and we've grown tons since). So easily $10 million in savings over the past ~9.5 years or so since we moved out. Executives have come and gone and tried to pitch cloud again but they could never come close to making costs work.

I have read over the years their support has improved so quite possibly the support response would not be what it was for us back then for that kind of customer. But seeing your comment reminded me of this experience.

VMware reveals critical vCenter hole it says ‘needs to be considered at once’

Nate Amsden

Re: Hey now

yes sorry forgot to mention HA. vCenter HA value is questionable to me it has it's own share of issues and the failover times are absolutely terrible (for my simple setups probably takes a good 6 minutes, I understand why it takes that long due to design of the apps HA is sort of a bolt on thing instead of a design thing). Then there's the times when you have to destroy HA to upgrade with schema changes and stuff. But I hope it is better than nothing...sometimes I wonder though.

Nate Amsden

Re: Hey now

As a linux user since 1996 count me in the group that really misses the .NET client. I run all my vCenter stuff in vmware workstation running windows anyway(Linux host OS). I held onto vCenter 5.5 for as long as I could.

Side note - am installing this on one of my 6.7 vCenter setups and the build number doesn't match, the ISO is VMware-vCenter-Server-Appliance-6.7.0.48000-18010531-patch-FP.iso and the actual build after installation is 18010599 (but it also says 48000 on the login screen) from the command "vpxd -v". Don't recall ever seeing a mismatch like this before myself.

Cisco discloses self-sabotaging SSD bug that causes rolling outages for some Firepower appliances

Nate Amsden

when might this end?

Seems like we have been getting reports for the last ~5 years or more about SSD firmware bugs that brick drives after X period of days from a wide range of manufacturers.

For me these firmware bricking bugs are the biggest concern I have with SSDs on critical systems. Fortunately I have never been impacted(as in had a drive fail as a result of these) yet. But have read many reports from others who have other the years.

Even the worst hard disks I ever used (IBM 75GXP back in ~2000 and yes I was part of the lawsuit for a short time anyway) did not fail like this. I mean you could literally have a dozen SSDs fail at exactly the same time because of this. It's quite horrifying.

I have a critical enterprise all flash array running since late 2014, no plans to retire it, all updates applied(with no expectation of any more firmware updates being made for these drives anymore), oldest drives are down to 89% endurance left so in theory endurance wise they could probably go another 20-30 years, though I don't plan to keep the system active beyond say 2026 assuming I'm still at the company etc.