* Posts by Nate Amsden

2438 publicly visible posts • joined 19 Jun 2007

Bill shock? The red ink of web services doesn’t come out of the blue

Nate Amsden

The last company I was at was a greenfield cloud thing. They had no app stacks, everything was brand new. Their existing technology was outsourced and that company did everything from software dev to hosting and support etc. At one point before I started the company felt they had outgrown that outsourced provider and wanted their own tech team to build their own app stack. So they hired a CTO and he built a team, and they started building the new software stack.

He hired a former manager of mine whom hired me at the previous company, I worked with him only a couple of months but that was enough I guess. That previous company was hosted in Amazon cloud(also greenfield). This manager saw the pitfalls of that and wanted me at the new company mainly to move them OUT of the cloud (they had yet to actually launch for production).

They launched production in Sept 2011(I joined May 2011), after doing many weeks of their best efforts at performance/scale testing(I was not involved in any of that part). All of those results were thrown in the trash after a couple of weeks and the knobs got turned to 11 to keep up with the massive traffic. Costs skyrocketed as well, as did annoying problems with the cloud. We started ordering equipment for our small colo (2 racks, each roughly half populated initially) in early Nov 2011, and installed it in mid Dec 2011, and then moved out of Amazon to those two racks in early Feb 2012(I was a bit worried as there was a manufacturing flaw in our 10Gig Qlogic NICs that was yet to be solved, ended up not causing any impacting issues though). I repeated a similar process for their EU business which had to be hosted in the Netherlands, I moved them out in July 2012 to an even smaller infrastructure, probably about half a rack at the time. In both cases, equipment was at a proper co-location, not in a server room at an office.

The project was pitched by my manager as having a 7-8 month ROI, the CTO got on board. It wasn't easy convincing the board but they went with it. Project was a huge success. I dug up the email the CTO sent back in 2012, and sent it to the company chat on the 10th anniversary last year. He said in part "[..] In day 1, it reduced the slowest (3+ sec) Commerce requests by 30%. In addition, it reduces costs by 50% and will pay for itself within the year."

I believe we saved in excess of $12M in that decade of not being hosted in cloud(especially considering the growth during those years). Meanwhile had better performance, scalability, reliability, and security. Last/Only data center failure I've experienced was in 2006 or 2007, Fisher Plaza in Seattle. I moved the company I was at out of there quite quick after that (they were already there went I started). Remember that cloud data centers are built to fail(a term I started using in 2011), meaning they are lower tier facilities which is cheaper for them, and is a fine model at high scale, you have to have more resilient apps or be better prepared for failure vs typical enterprise on prem situation.

So count me as someone who disagrees, greenfield cloud is rarely the best option.

Basecamp details 'obscene' $3.2 million bill that caused it to quit the cloud

Nate Amsden

Re: Hiring impact

That is very interesting, and unfortunate for the customers. Sounds like that is not a real SaaS stack? Perhaps some hacked together stuff operated as a managed service?

I would not expect in a SaaS environment a customer would even be able to look at the underlying infrastructure metrics or availability it's just not exposed to them. I know I got frustrated using IaaS years ago, because not enough infrastructure data was available to me.

Nate Amsden

Getting good at storing files doesn't have to be basecamp's business.

People seem to jump to the end conclusion, either you build everything yourself, or you use a public cloud, and I just don't understand why. There is a massive chasm of space in between those two options in the form of packaged solutions from vendors like HPE, Dell, and others. Many different tiers of equipment hardware, software and support.

Nate Amsden

8PB is a lot, but it's not really for object storage. HPE Apollo 4510 is a 4U server that can have up to 960TB of raw storage(so ~10PB per rack, assuming your facility supports that much power/rack). Depending on performance needs - 60 x 16TB drives may not be enough speed for 960TB by itself. Probably want some flash in front of that(handled by the object storage system). Of course you would not be running any RAID on this, data protection would be handled at the object layer. Large object storage probably starts in the 100s of PB, or Exabytes.

There's no real need to use CEPH which is super complex(unless you like that). Probably better to be using something like Cohestity or Scality (no experience in either), both available for HPE Apollo(and other hardware platforms I'm sure). There are other options as well.

I think I was told that Dropbox leveraged HPE Apollo or similar HPE gear +Object storage software when they moved out of Amazon years ago. As of 2015 Dropbox had 600PB of data according to one article I see here.

I'm quite certain it would be easy to price a solution far less than S3 at 8PB scale, even less scale. You also don't need as much "protection" if you choose proper facilities to host at. Big public cloud providers cut corners on data center quality for cost savings. It makes sense at their scale. But users of that infrastructure need to take extra care in protecting their stuff. Vs hosting it yourself you can use the same model if you want, but if you are talking about 8PB of data that can fit in a single rack(doing that would likely dramatically limit the number of providers you can use to support ~20kW/rack? otherwise split into more racks), I would opt for a quality facility with N+1 power/cooling. Sure you can mirror that data to another site as well, but no need for going beyond that (unless geo latency is a concern for getting bulk data to customers).

Nate Amsden

Re: Open source

You got a bunch of down votes but you are right for the most part. A lot of the early open source models was release the source for free and then have a business around supporting it. Not everyone would sign up as customers but the good will from releasing the source would attract users. It worked well for several companies, and of course public cloud is taking that away from a lot of these orgs, which is unfortunate. And as El Reg has reported several such companies have been very vocal about this situation.

Obviously in many(maybe all?) cases the license permits this usage(at the time anyway, some have introduced licensing tweaks since to prevent it), but I'm quite sure if you went back in time ~15ish years and asked the people making the products did they anticipate this happening they would probably say no in almost all cases(perhaps they would of adjusted their licenses if they viewed that possibility as a credible threat). At the end of the day the big difference with these cloud companies and earlier generations of "mom & pop" ISPs that were using Apache or whatever to host their sites, is just massive scale.

Those licensing their code in BSD licensing or similarly completely open licensing probably wouldn't/shouldn't care anyway.

Similarly for the GPL, a trigger of sorts to making the GPLv3 was the TiVo "exploiting" a loophole in the GPLv2. So GPLv3 was made to close that hole(and perhaps others). There's even a Tivioization term they made

https://en.wikipedia.org/wiki/Tivoization

"In 2006, the Free Software Foundation (FSF) decided to combat TiVo's technical system of blocking users from running modified software. The FSF subsequently developed a new version of the GNU General Public License (Version 3) which was designed to include language which prohibited this activity."

Nate Amsden

Re: Hiring impact

I think you are incorrectly confusing SaaS and IaaS in your statement.

Mom and Pop shops that have minimal IT needs likely will have almost zero IaaS, because they can't manage it. IaaS (done right) IMO requires more expertise then on prem, unless you have a fully managed IaaS provider. But the major players don't really help you with recovery in the event of failure, it's on the customer to figure that out. vs on prem with vmware for example if a server fails the VMs move to another server, if the storage has a controller failure or a disk failure there is automatic redundancy. Doesn't protect against all situations of course but far more than public cloud does out of the box. If Mom & Pop shop just have a single server with no redundant storage etc, if that server has a failure, they can get it repaired/replaced generally with minimal to no data loss. Vs server failure in the major clouds is generally viewed as normal operations and the recovery process is more complex.

I've been calling this model "built to fail" since 2011, meaning you have to build your apps to handle failure better than they otherwise would need to be. Or at least be better prepared to recover from failure even if the apps can't do it automatically.

SaaS is a totally different story, where the expertise of course is only required in the software being used, not any of the infrastructure that runs it. Hosted email, Office, Salesforce, etc etc..

On prem certainly needs skilled staff to run things, but doing IaaS public cloud(as offered by the major player's standard offerings) right requires even more expertise(and more $$), as you can't leverage native fail over abilities of modern(as in past 20 years) IT infrastructure, nor can you rely on being able to get a broken physical server repaired(in a public cloud).

Nate Amsden

Re: Cloud Vs On-Prem

Should do the math for how bursty is bursty. At my last company I'd say they'd "burst" 10-30X sales on high events, but at the end of the day the difference between base load and max load was just a few physical servers(maybe 4).

IMO a lot of pro cloud folks like to cite burst numbers but probably are remembering the times of dual socket single core servers as a point of reference. One company I was at back in 2004 we literally doubled our physical server capacity after a couple of different major software deployments. Same level of traffic, just app got slower with the new code. I remember ordering direct from HP and having stuff shipped overnight (DL360G2 and G3 era). Not many systems, at most maybe we ordered 10 new servers or something.

Obviously modern servers can push a whole lot in a small (and still reasonably priced) package.

A lot also like to cite "burst to cloud", but again have to be careful, I expect most transactional applications to have problems with bursting to remote facilities simply due to latency (whether the remote facility is a traditional data center or a cloud provider). You could build your app to be more suited to that but that would probably be quite a bit of extra cost (and ongoing testing), not likely to be worth while for most orgs. Or you could position your data center assets very near to your cloud provider to work around the latency issue.

Now if your apps are completely self contained, or at least fairly isolated subsystems, then it can probably work fine.

One company I was at their front end systems were entirely self contained, no external databases of any kind). So scalability was linear. When I left in 2010 (company is long since dead now) costs for cloud were not worth using vs co-location. Though their front end systems at the time consisted of probably at most 3 dozen physical servers(Dell R610 back then, at their peak each server could process 3,000 requests a second in tomcat) spread across 3-4 different geo regions(for reduced latency to customer as well as fail over). Standard front end site deployment was just a single rack at a colo. There was only one backend for data processing that was about 20 half populated racks of older gear.

Nate Amsden

nice to see

Nice to see them go public about this. Not many companies are open about this kind of stuff. Another one I like to point out to people (but with far less detail, mainly just a line item in their public budget at the time is this https://www.geekwire.com/2014/moz-posts-2013-5-7m-loss/ . They don't call it out in the article text, but there is a graphic there showing their budget breakdown and their cloud services taking between 21-30% of their REVENUE(with cloud spend peaking at $7M), and you can see in the last year they were moving out as they had a data center line item.

I moved my last org out of cloud in early 2012, easily saved well over $12M for a small operation in the decade that followed. I had to go through justification again and again as the years went on and new management rolled in(and out) thinking cloud would save them money. They were always pretty shocked/sad to see the truth.

Previous org to that I proposed moving out but the board wasn't interested(but everyone else was including CEO and CTO but not enough to fight for the project), they were spending upwards of $400-500k/mo at times on cloud(and having a terrible experience). I left soon after, and the company is long since dead.

You can do "on prem" very expensive and very poorly but it's far easier to do cloud very expensive and very poorly.

Cisco warns it won't fix critical flaw in small business routers despite known exploit

Nate Amsden

Re: White Box Switches and Cumulus Linux

People could have the same issue here depending on their hardware. When Nvidia bought Mellenox they killed off support for Broadcom on Cumulus Linux, had a lot of upset users. Looks like Cumulus 4.2 was the last one to support Broadcom chips. (I have never used Cumulus/Mellenox or white box switches in general myself)

Assuming you purchased your gear before the acquisition(2020), since you said "a few years ago", so hopefully your switches are not Broadcom based if you ran them with Cumulus.

Nate Amsden

Re: Time to dump Cisco

Curious can you name any such products especially in the networking space? I've been doing networking for about 20 years and haven't heard of any vendor/product remotely approaching 15 years of support after end of sale, at most maybe 5 years?

After long delays, Sapphire Rapids arrives, full of accelerators and superlatives

Nate Amsden

intel wasn't thinking straight

Realized this and wanted to post about it. These new chips are nice, but obviously one of the big users of the chips will be VMware customers. New VMware licensing comes in 32-core increments (and I think MS Windows server licensing is 8 core increments after the first 16 cores?)

Intel says (according to HP) that the "P" series Xeons are targeted at Cloud/IaaS systems. There's only 1 P series chip (at least for DL380 Gen11), and that has 44 cores. So you're having to license 64 cores to use that processor but of course only have 44 available(and I believe for a dual socket 44-core system(88 cores) that would require 128 cores of vmware licensing as they track cores per socket not cores per system according to their docs). At the top end of 60 cores you're paying for 64 cores of VMware licensing for 60 cores of capacity(or again 128 cores of licensing for dual socket 60 core). Intel's previous generation had 32 and 40 core processors(less than 32 would probably be a waste unless you are also having to license windows or Oracle and are concerned about those per core licensing).

Vs AMD on their latest gen has CPUs with 32/64/96 cores, all of the which divide quite well into 32-core licensing. On the previous generation AMD processors they have both 32 and 64 core processors.

A vSphere enterprise license(1 socket/32-cores) with 3 years production support was about $10k last I checked(that's before adding anything like NSX, vSAN, vROPS or whatever else, for me I just use the basic hypervisor), which is a good chunk of the cost of the server.

Nate Amsden

basic comparisons

Obviously the raw specs don't indicate what the actual performance is, but was just curious looking at the HPE DL380 Gen11 vs the DL385 Gen 11. A few things that stand out to me -

Intel says that Cloud/IaaS workloads benefit from fewer cores/higher frequencies(Intel P series Xeon), which seems strange to me, of course it depends on the app. But as a ESXi user for many years I'll take more cores any day(especially if I don't have to worry about windows/Oracle licensing ..). They say databases benefit from more cores(H series Xeon).

Intel's top end 8490H 60-core 350W processor runs at 1.9Ghz and has 112MB of L3 cache.

AMD's top end 9654 96-core 360W processor runs at 2.4Ghz and has 384MB of cache(HPE isn't specific on AMD if that is L3 or aggregate on all caches for DL385)

On the flip side, the DL380Gen11 has 16 memory slots per socket, vs the DL385Gen11 has only 12 (the DL385 Gen10+ V2 has 16/socket), can only assume the size of the new AMD chips are much larger than the Intel chips.

AMD's previous generation chips (in DL385Gen10+ V2) topped out at 280W, vs the new chips go to 360W (I have an earlier version of the Gen11 data sheet that actually says they draw 400W, had to double check vs the current data sheet).

Intel's previous generation chips (in DL380 Gen10+) topped out at 270W, vs the new chips at 350W.

I *think* I'd rather take the previous generation 64-core/socket DL385Gen10+ V2 with the extra memory slots and less power usage/server(and probably better stability with the firmware/etc being around for longer). Would be a tough choice. Key point for me is my workloads don't even tax the existing DL380Gen9 servers CPU wise(but I want lots of cores for lots of VMs/host)

Amazon slaps automatic encryption on S3 data

Nate Amsden

Re: Really?

The way I believe most object storage works on the backend is blocks are replicated between nodes. So even if someone were to get their hands on unencrypted drives that were used for S3 for some nodes they'd only get partial bits of data, maybe a determined attacker could get something useful out of those partial bits but it would be a PITA.

Same reason I have never had a concern about not running encrypted at rest on 3PAR, if someone got a "pre failed" disk from one of my arrays, they'd just have random 1GB chunks of filesystems. Maybe you get lucky and find something useful but for anyone willing to go to those lengths they'd have to be very determined and there are probably much easier ways to compromise security. Of course there are industries/audit processes that require encryption at rest just for the checkbox.

Rackspace blames ransomware woes on zero-day attack

Nate Amsden

Re: Not us then

sure thing, I didn't know it myself until about a month ago(not my fault as I have never been responsible for Office 365 nor exchange in my career), I knew there were office 365 backup solutions out there, and was looking into them a bit more out of curiosity and saw them quote that Microsoft site.

It's pretty bad that most office 365 admins don't seem to understand it, and are just assuming MS is invincible and they don't have to worry about backups, at least in my experience seeing people write "you should just move to office 365", almost never have I seen them also say "oh but you need to keep your own backups too".

I am not sure if Rackspace had any formal way for customers to take proper backups (aside from outlook archives).

Nate Amsden

Re: Not us then

Microsoft would say the same if you use Office 365, backups are the responsibility of the customer.

https://learn.microsoft.com/en-us/azure/security/fundamentals/shared-responsibility

Welcome to cloud.

(myself I have self hosted email for 24 years, and haven't been responsible for corporate email since 2002 at that point ran email with postfix/cyrus imap which is what I use for home still)

Elon Musk's cost-cutting campaign at Twitter extended to not paying rent, claims landlord

Nate Amsden

Re: Inevutable

I left a position 2 companies ago(~12 years ago). I've always made it a point not to sign contracts myself even if I had authority(and in many cases the companies I worked for didn't care if I signed provided it was approved, but often times I just prefer not to personally sign regardless).

Anyway after I left the CTO tried to terminate one of the contracts, I think it was either for a DNS provider Dyn, or Webmetrics for website monitoring. They were under contract but said the contracts weren't valid because I wasn't authorized to sign. Funny thing is the vendor pulled up the contract and found/showed my former employer(who is long out of business now) that in fact it was my Senior Director who signed for the contract in question(same director essentially resigned a day after I left, later tried to recruit me to join him at Oracle cloud but I declined and he later retired), not me, and so their argument was invalid. I got a good laugh out of that story.

Stolen info on 400m+ Twitter accounts seemingly up for sale

Nate Amsden

Re: 400m users

Why would you think that? Twitter probably has far more than 400m accounts (I'd be surprised if they had less than 1.2 billion including bots/fake accounts/etc), the article does not indicate any of the accounts were active. Likely there are a bunch in the list that were, but maybe it's only 10-20% of "active" accounts. Or maybe a higher number, or a lower number..

Linux kernel 6.2 promises multiple filesystem improvements

Nate Amsden

This is not accurate. I've seen people write this 1GB per TB tons of times.

The 1GB per TB was always about ZFS with dedupe enabled. Without dedupe you can get by with much less.

Myself on my laptop I still use ext4 despite having 128G of ram, just because it's simpler.

I do use ZFS in some cases, mainly at work, mainly for less used MySQL servers with ZFS compression enabled(and I use ZFS as a filesystem only, RAID is handled by the underlying SAN which is old enough not to support compression).

My home server runs an LSI 8-port SATA RAID card with a battery backup unit, 4x8TB drives in hardware RAID 10 with ext4 as the filesystem and weekly scrubs(via LSI tools). I used ZFS for a few years mainly for snapshots on an earlier home server(with 3Ware RAID 10 and weekly scrubs) but ended up never needing the snapshots, so I stopped using ZFS.

I do have a Terramaster NAS at a co-location for personal off site storage which runs Devuan, and ZFS RAID 10 on 4x12TB disks. Boot disk is a external USB HP 900G SSD with ext4 again. That's the only place I'm using ZFS' RAID.

Haven't used anything but RAID 1/10 at home at least since about 2002, which was a 5x9GB SCSI RAID with a Mylex DAC960 RAID card. Largest number of disks in array at home since has been 4.

At work I'm still perfectly happy with the 3PAR distributed sub disk RAID 5 (3+1) on any of the drives I have spinning or SSD.

openSUSE Tumbleweed team changes its mind about x86-64-v2

Nate Amsden

Re: Sensible

I remember back in the 90s efforts to optimize by compiling for i586 or i686 for example, then there was the egcs compiler(which I think eventually became gcc?), and then gentoo came around at some point maybe much later targeting folks that really wanted to optimize their stuff. FreeBSD did this as well to some extent with their "ports" system(other BSDs did too but FreeBSD was the most popular at the time probably still is). I personally spent a lot of time building custom kernels, most often static kernels I didn't like to use kernel modules for whatever reason. But I tossed in patches here and there sometimes, and only built the stuff I wanted. I stopped doing that right when the Kernel got rid of the "stable" vs "unstable" trees as the 2.4 branch was maturing.

Myself I never really noticed any difference. I've said before to folks if there's not at least say a 30-40% difference then likely I won't even notice regardless (not referring specifically to these optimizations but referring to upgrading hardware or whatever). A 20% increase in performance for example I won't see. I may see if I am measuring something, such as encoding video for example. But my computer usage is fairly light on multimedia things (other than handbrake for encoding, I have ripped/encoded my ~4000 DVD/BD collection, but encoding is done in the background, so 20% faster doesn't mean shit to me, double the speed and I'll be interested provided quality is maintained). All of my encoding is software, I don't use GPU encoding.

I haven't gamed seriously on my computer in over a decade, I don't do video editing, or photo editing, etc etc. I disable 3D effects on my laptop (Mate+Mint 20), even though I have a decent Quadro T2000 with 4G of ram(currently says 17% of video memory is used for my 1080p display). I disable them for stability purposes(not that I had specific stability problems with them on that I recall, I also disable 3D acceleration in VMware workstation for same reason). I've never had a GPU that required active cooling and I have been using Nvidia almost exclusively for 20 years now (I exclude laptops since pretty much any laptop with Nvidia has fans, but the desktop GPUs I have bought, none have ever had a fan).

I really don't even see much difference between my 2016 Lenovo P50 with i7 quad core, SATA boot SSD (+2 NVME SSDs) and 48G of ram with Nvidia Quadro M200M, to my new(about 2 months old now) Lenovo P15 Xeon 8 core, 2 NVMe SSD, and 128G of ECC ram and Quadro T2000. It's a bit faster in day to day tasks, but I was perfectly happy on the P50.

My new employer insisted they supply me with new hardware so I said fine, if you want to pay for it, this is what I want. They didn't get it perfect, I replaced the memory with new memory out of pocket and bought the 2nd NVME SSD(not that I needed it, just thought fuckit I want to max it out). I was open this time around to ditching Nvidia and going Intel video only, but turns out the P15 laptop I wanted only came with Nvidia (even though it's hybrid, I think..). Since the Nvidia chip is there anyway I might as well use it, I've never had much of an issue with their stuff unlike some others that like to run more bleeding edge software. I expect a 6-10 year lifespan out of this laptop so I think it's worth it.

On the 12th day of the Rackspace email disaster, it did not give to me …

Nate Amsden

Don't forget

Those saying Office 365 is the best thing to use, MS is up front about their stuff too:

https://learn.microsoft.com/en-us/azure/security/fundamentals/shared-responsibility

The "information and data" section is shown as the sole responsibility of the customer, not of MS.

"Regardless of the type of deployment, the following responsibilities are always retained by you:

Data

Endpoints

Account

Access management"

There are backup solutions for Office 365 for a reason.

I'm assuming here probably greater than 90% of their customers don't realize this.

(I don't vouch for any provider in particular, I'm a linux/infrastructure person never touched Exchange in my life, and I've been hosting my own personal email on my own personal servers(co-lo these days) since 1997)

Nate Amsden

Re: Right.

I have read people claim ransomware can sit waiting for upwards of 90+ days before striking. I used that justification to finally get a decent tape drive approved a few years ago for my last company's IT dept, then they used Veeam to backup to tape.

I suppose in theory if you restored old data onto a server WITH A CLOCK SET TO THE RIGHT TIME (not current time), then perhaps it could be fine, but of course systems don't often behave well when their clocks are out of sync.

So for example if you restored data from 45 days ago, onto a server with it's clock set to 45 days ago then you may be ok, if you restore it to a server with current time then perhaps the existing ransomware will see the strike time has passed and activate again. I've never been involved in ransomware myself so don't know how fast it is acting.

I'd assume this was a highly targeted attack against rackspace not a drive by thing.

Also read years ago on multiple occasions claims from security professionals that on average intruders had access to a network for roughly 6 months before being detected, first saw this claim reported by the then CTO of Trend Micro, saw a presentation from him at a conference, normally I hate those kinds of things but that guy seemed quite amazing. I was shocked to see him admit on stage that "if an intruder wants to get in, they will get in, you can't stop that". And not try to claim his company's products can protect you absolutely. I posted the presentation in PDF form(probably from 2014) here previously, though it loses a lot of it's value without having the dialog that went along with it

http://elreg.nateamsden.com/TH-03-1000-Genes-Keynote_Detect_and_Respond.pdf

Nate Amsden

Re: So where are the backups?

at my first system admin job back in 2000 one of the managers there(not someone I reported to) would on occasion ask me to restore some random thing. I thought it was a legitimate request so I did (or tried, sometimes could not depending on the situation). Later he told me he didn't actually need that stuff restored he was just testing me. Which I thought was interesting. I wasn't mad or anything. Never have had a manager do that again or at least never admit to it.

One company I was at we finally got a decent tape drive and backup system in place. I went around asking everyone what do they need backed up, as we don't have the ability to backup EVERYTHING (most of the data was transient anyway). Fast forward maybe ~6-9 months we have a near disaster on our only SAN. I was able to restore everything that was backed up, some requests did come in to restore stuff that was never backed up and I happily told them, sorry I can't get that because you never requested it be part of what was backed up. In the end minimal data loss from the storage issue but there was several days of downtime to recover.

My first near disaster with storage failure(wasn't on the backend team that was responsible) was in 2004 I believe, double controller failure in the SAN took out the Oracle DB. They did have backups, but they knowingly invalidated the backups every night by opening them read/write for reporting purposes. Obviously the team knew this and made the business accept that fact. So when disaster struck, it was much harder to restore data as you couldn't simply copy data files over, or restore the whole DB as the reporting process was rather destructive. Again multiple days of downtime to get things going again, and I recall still encountering random bits of corruption in Oracle a year later(would result in a ORA-600 or something error and the DBA would then go in and zero out the bad data or something).

My most recent near storage disaster was a few years ago at my previous company. Their accounting system hadn't been backed up in years apparently, IT didn't raise this as a critical issue, if they had raised it with me I could have helped them resolve it, it wasn't a difficult problem to fix just one they didn't know how to do themselves. Anyway, the storage array failed, again double controller failure. End of life storage array in this case. They were super lucky that I was friends with the global head of HPE storage at the time and after ~9 hours of our 3rd party support vendor trying to help us I reached out to him in a panic and he got HPE working around the clock, took about 3 days to find and repair the metadata corruption, minimal data loss(no data loss for the accounting folks). Was quite surprised when I asked for a mere $30k to upgrade another storage system so we could move the data and retire the end of life one, and the same accounting people who almost lost 10 years of data with no backups told me no.

IBM to create 24-core Power chip so customers can exploit Oracle database license

Nate Amsden

Re: For now...

Kind of surprised they haven't already. I moved a company from Oracle EE to Oracle SE back in about 2008 for exactly this reason(did so as a result of the company failing their 2nd Oracle audit, after ignoring my advice to make this change after failing the first Oracle audit which they were caught for running Oracle EE when they only had a license for Oracle SE One, not even SE, SE ONE). I remember buying our servers for Oracle EE I opted for the fastest dual core processors, by the time we switched to SE quad core had come out so I changed them to single socket quad core. Even encountered a compatibility issue on our early DL380G5s when upgrading to quad core they didn't work without a motherboard replacement, HP later realized this and updated their docs. I don't think they charged us for the replacement since they told us it would work when we ordered the new parts, and it was their staff doing the hardware change. I remember AMD talking shit about Intel's quad core chips not being true quad core but a pair of dual core chips(early version of "chiplet" maybe?).

Also remembering having to "school" Oracle's own auditors regarding Oracle SE licensing specifically regarding the unlimited cores per socket which they didn't believe until they looked it up themselves and their hearts sank when they realized they could not do per core licensing on that.

Back then you could run Oracle Enterprise Manager with the performance tools(which were great IMO) on Oracle SE, even though technically you could not from a licensing standpoint(can only use it on Oracle EE). If they ever audited again I could easily uninstall OEM with a simple command(which I had to do regularly for various reasons on test systems). Newer versions of Oracle made this trick impossible(at least for me). Oracle did not audit again before the company shut down. Note: I am not a DBA but sometimes I fake it in my day job.

Also did the same for some of our VMware hosts, VMware at the time required purchasing licensing in pairs of sockets, and they said they did not support single cpu systems. Though I assumed at the time that really meant single socket single core.

Then I combined the two, at least in a couple of cases, Oracle and Vmware on top of a single socket DL380G5 with a quad core CPU and I don't remember how much memory or disk, some of the systems were connected to my first 3PAR, a tiny E200. Probably totally unsupported by anyone officially, Oracle's policy at the time is you had to reproduce the issue on a supported system. But I don't recall ever having to open a support case with either company while I was there, at least not on the ones running on VMware, did have some support cases for production which ran on bare metal.

At the time Oracle SE was max 2 sockets per system(no core restrictions) and max 4 sockets in a RAC. Haven't looked at their model since but it sounds like it's probably the same today.

Oracle clouds never go down, says Oracle's Larry Ellison

Nate Amsden

Re: IaaS loves to blame the customer

One more story, I like talking/writing about this kind of stuff. This was from a former co-worker, who said they used to work as a tech at some data center. It was a small one, no big names. But his story(which was some time before 2010) was that they had a generator on site in a building or some structure to protect it from the elements outside. They ran load tests on it semi regularly, but the load tests were only for a few minutes.

One day a power outage hit, the generator kicked on as expected, then shut down after something like 15-30 minutes as it overheated(I think he said the overheating was related to the enclosure the generator was in). So in that situation we had bad design, and bad policies, either of which should of caught the issue long before it impacted customers.

Another case of bad design IMO is any facility using flywheel UPS. My thoughts there is I want technical on site staff 24/7 at any facility that is able to respond to stuff. Flywheel UPS only gives enough runtime for a a few seconds, maybe a minute for generators to kick on. That is not enough time for a human to respond to a fault(such as the switch starting the generators fails or something, this happened at a San Francisco facility that used flywheel back in 2005ish?). I was touring a new(at the time) data center south of Seattle in 2011, a very nice facility, Internap was leasing space there and was talking with them about using it. I mentioned my thoughts on flywheels and the person giving the tour felt the same, and said in fact I think it was Microsoft that had a facility near by at the time that used flywheels and he claimed they had a bunch of problems with them.

Not that UPSs are flawless by any extent, I just would like to see at least 10 minutes of power available between failure and generator kick on, however that power is provided is less important as long as it works. Flywheels(the ones I'm aware of anyway) don't last long enough. Certainly there will be situations where a failure cannot be fixed in 10 minutes, but I'm confident at least there are some scenarios where it can be (such as the automatic transfer switch not "switching" automatically and needing someone to manually switch it is the biggest).

Nate Amsden

Re: IaaS loves to blame the customer

dug up the power issue from Amsterdam, which was previously a Telecity data center that Equinix acquired by this point(2018):

"DESCRIPTION: Please be advised that Equinix and our approved contractor will be performing remedial works to migrate several sub-busbar sections back from there temporary source to the replaced main busbar which became defective as reported in incident AM5 - [5-123673165908].

During the migration, one (1) of your cabinet(s) power supplies will be temporary unavailable for approximately six (6) hours. The redundant power supply(s) remains available and UPS backed. "

But this power incident wasn't critical for me since everything was redundant on my end, I'm not a power expert so certainly can't say for sure if a better power design could of allowed this kind of maintenance to be done without taking power circuits down to customers. But I can say I've never had another facility provider need to take power offline for maintenance for any reason in almost 20 years. Perhaps this particular activity it would be impossible to avoid I don't know.

After Equinix acquired Telecity I noticed the number of customer notifications went way up, Telecity had a history with me at least of not informing customers of stuff. I hated that facility and staff AND policies so much. Only visited it twice, before Equinix took over, according to my emails looks like we moved out less than 3 months after the above power issue (move was unrelated to that).

Nate Amsden

Re: IaaS loves to blame the customer

I don't agree there at all. Good infrastructure management is good management. Having a properly designed facility is a good start. Well trained, knowledgeable staff is also important. Having and following standards is also important.

That Fisher plaza in Seattle at the time as far as I recall had issues such as:

* Staff not replacing UPS batteries before they expired

* Not properly protecting the "Emergency Power off" switch (which was one power incident a customer pressed it to find out what would happen, after that all customers required "EPO Training")

* Poor design led to a fire in the power room years after I moved out which caused ~40 hours of downtime and months of running on generator trucks parked outside. A couple years later I saw a news report of a similar fire at a Terremark facility, in that case they had independent power rooms, and there was zero impact to customers.

* Don't recall the causes of other power outages there if there were any other unique causes.

Another facility I was hosted in Amsterdam had an insufficient power design as well, and poor network policies

* The network team felt it was perfectly OK to do maintenance on the network, including at one point taking half of their network offline WITHOUT TELLING CUSTOMERS. They fixed that policy after I bitched enough. My normal carrier of choice is Internap, which has a 100% Uptime SLA, and has been excellent over the past 13 years as a network customer. Internap was not an option in Amsterdam at the time so we went with the facility's internet connection which was wired into the local internet exchange.

* At one point they told customers they had to literally shut off the "A" power feeds to do something, then the following week they had to shut off the "B" power feeds to do that thing to the other side, don't recall what it was, but obviously they didn't have the ability to do maintenance without taking power down (so am guessing no N+1). No real impact to either event on my end, though we did have a few devices that had only 1 PSU(with no option on those models for a 2nd), so we lost those, however they had redundant peers so things just failed over. In nearly 20 years of co-location only that facility ever had to take power down for maintenance.

One company I was at moved into a building (this was 18 years ago) that was previously occupied by Microsoft. We were all super impressed to see the "UPS Room", it wasn't a traditional UPS design from what I recall, just tons of batteries wired up in a safe way I imagine. They had a couple dozen racks on site. Wasn't till later the company realized most/all of the batteries were dead so when they had a power outage it all failed. None of that stuff was my responsibility, all of my gear was at the co-location.

My first data center was in 2003, an AT&T facility. I do remember one power outage there, my first one, I recall I was walking out of the facility and was in the lobby at the time when the lights went out. I remember the on site staff rushing from their offices to the data center floor and they stopped to assure me the data center floor was not affected(and it wasn't). Power came back on a few minutes later, don't recall if it was a local issue to the building or if it was a wider outage.

My first server room was in 2000. I built it out with tons of UPS capacity and tons of cooling. I was quite proud of the setup, about a dozen racks. Everything worked great, until one Sunday morning I got a bunch of alerts from my UPSs saying power was out. Everything still worked fine but about 30 seconds later I realized that while I have ~45min of UPS capacity I have no cooling right now so I rushed to the office to do graceful shutdowns of things. Fortunately things never got too hot I was able to be on site about 10 mins after the power went out. There was nothing really mission critical there, it was a software development company and the majority of the gear was dev systems, the local email server(we had 1 email server per office) and a few other things were there as well.

There are certainly other ways to have outages, I have been on the front lines of 3 primary storage array failures in the last 19 years, arrays which had no immediate backup so all of the systems connected to the arrays were down for hours to days for recovery. And I have been in countless application related outages as well the worst of which date back 18 years ago an unstable app stack being down for 24+ hours and the developers not knowing what to do to fix it. At one point there we had Oracle fly on site to debug database performance issues too. I've caused my own share of outages over the years though I probably have a 500:1 ratio of outages I've fixed or help fix vs outages I caused.

My original post, in case it wasn't clear, was specific to facility availability and to a lesser extent network uplink availability.

Nate Amsden

IaaS loves to blame the customer

That's something that surprised me a lot back when I first started using cloud 12 years ago now(haven't used IaaS in a decade now). Some of their SLAs(perhaps most) are worded in ways to say, oh well if this data center is down it's not really down for you unless you can't fire up resources in another data center. If you don't have your data in multiple data centers well that's your fault and we don't owe you anything.

Which to some degree makes sense, customers using cloud and not knowing how to do it "right" (because doing it right will just make it more expensive in many cases and certainly more complex). Most traditional providers(either datacenter or network or infrastructure) will of course advise you similarly but they will often take much greater responsibility when something bad happens even if the customer didn't have better redundancy.

Myself I haven't been hosted in a data center(for work) that had a full facility failure since 2007. That's 15 years of co-location with zero facility outages. So forgive me if I'm not going to get super stressed over not having a DR site. That data center in 2007 (Fisher Plaza in Seattle, and I moved the company out within a year of starting my new position there) remains the only facility I've been with that had serious issues going back to 2003.

Of course not all facilities are the same. The facility I use for my personal co-location HAS had several power outages in the past decade(went a good 5-6 years before the first one when I became a customer). But they are cheap, and otherwise provide decent service. I can live with those minor issues(probably still better uptime than Office365 over the years even with my SINGLE server, not that I'm tracking). I need only to walk into that facility to immediately rule it out for anything resembling mission critical or anything resembling not fully active-active (across multiple facilities) operations. They don't even have redundant power(facility dates to the 90s).

I've said before I would probably guesstimate that I'd rule out 60-75% of data centers in the world for mission critical stuff(Bing tells me there are ~2500 global data centers). All of the big cloud providers design their systems so their facilities can fail, it's part of their cost model, so naturally I am repelled by that.

VMware loses three top execs who owned growth products

Nate Amsden

Re: Troubled phrasings ?

I'd expect most customers are on maintenance contracts so the new versions would be provided to them free as part of maintenance. At least that's how it works with ESXi and vCenter.

Cloud customers are wasting money by overprovisioning resources

Nate Amsden

Re: I have wondered about de-dupe

I don't believe most IaaS clouds do dedupe for storage at least not the big ones. The enterprise clouds I'm sure do. I'd expect customers to not see any line items on their bills related to dedupe, the providers would just factor in what their typical dedupe ratios are and figure that into the cost to the customers.

But forget dedupe, I'd expect most cloud providers to not even do basic thin provisioning and reclamation (except enterprise clouds again for same reasons). Thin provisioning AFAIK was mainly pioneered by 3PAR back around 2003ish time frame, I started using them in 2006, thin reclaim didn't appear until about late 2010 I think(and took longer to get that working right). Then discard at the OS/hypervisor level took time to implement as well(3PAR's original reclaim was "zero detection" and so I spent a lot of time with /dev/zero writing zeros to reclaim space, also sdelete on windows prior to discard being available).

For my org's gear we didn't get end-to-end discard on all of our Linux VMs (through to the backend storage) until moving to Ubuntu 20 (along with other hypervisor VM changes) in late 2020. I had discard working fine on some VMs that used raw device maps for a while prior. I know the technology was ready far before late 2020, but to do the changes to the VMs it was better to wait for a major OS refresh (16.04->20.04 in our case) rather than shoehorn the changes inline. Wasn't urgent in any case.

I remember NetApp pushing dedupe hard for vmware stuff back in 2008-2010 time frame, I never really bought into the concept for my workloads. I'm sure it makes a lot of sense for things like VDI though. When I did eventually get dedupe on 3PAR in 2014 (16k fixed block dedupe, I don't know what NetApp's dedupe block size was/is) I confirmed my original suspicions, the dedupe ratio wasn't that great since there wasn't that much truly duplicate data(which would of been OS data and typical OS was just a few gigs in Linux). I expected better dedupe on VMware boot volumes(boot from SAN), initially the ratio was great(don't recall what exactly), my current set of boot LUNs were created in 2019, and now the current dedupe ratio is 1.1:1, which is basically no savings, so next time around I won't enable dedupe on them. (ESXi 6.5 here still, I read that ESXi 7 is much worse for boot disk requirements). Average vmware boot volume is 4.3GB of written data on a 10G volume.

Nate Amsden

cloud architecture is the problem

I realized this about 12 years ago myself. Best to move away from the model of fixed provisioning of resources(fixed meaning provisioning a VM with a set cpu/mem/disk), and towards pooling of resources and provisioning from that pool (how ESXi works, and I assume how other hypervisors like Xen/HyperV work on prem). Same with disk space/IO. Nearly 70% of the VMs in my internal environment this year were 1 CPU. Memory ranges from 2GB to 32GB for most things.

Disk space for most systems less than 10GB/ea, some have 300-500GB(couple have more), some have 1TB. But every Linux VM gets (by default) 1.8TB of thin provisioned storage, controlled via LVM(so I don't have to touch the hypervisor again if I need more space), and I have discard/trim enabled end to end(and it works, except for things that use ZFS, even though ZFS claims to support trim(and autotrim is enabled at the pool level), my experience shows it is completely ineffective with non-test workloads at least when compression is enabled). All storage pooled from the same back end, and of course I keep close tabs on what is using disk I/O. Though disk i/o hasn't been an issue since switching to all flash in 2014. There was a time with spinning disks that a single "bad" MySQL query would consume more disk I/O than 500+ other VMs on the same storage array combined. Fortunately Percona wrote pt-query-kill, so I used that to keep those queries under control.

This pooling approach to VMs is easily 15 years old at this point it just blows my mind that people don't seem to understand this in 2022 (some do for sure, but most do not).

Equinix to cut costs by cranking up the heat in its datacenters

Nate Amsden

Re: We make a rod for our own backs...

Google did something more creative than that, at one point looks like back in 2009 they released info showing that they were building servers with batteries built in (instead of large centralized UPSs), with the justification being that most power events only last a few seconds, so they could cut cost/complexity with that design.

Don't know how long that lasted or maybe they are still doing that today. Never recall it being mentioned since.

Nate Amsden

Re: We make a rod for our own backs...

oh yeah, that's right, sorry was a long time ago!

Nate Amsden

Re: This is not how data centers work

Be sure to deploy your own environmental sensors.. Most good PDUs have connections for them. I have at least 4 sensors (2 front/2 back) on each rack(2 PDUs * 2 sensors each)). They monitor temperature and humidity.

I remember the first time my alarms for them tripped, I opened a ticket with the DC asking if anything had changed. It wasn't a big problem (the humidity had either dropped or exceeded a threshold I forget which), I was just curious the dramatic change in readings. They responded that they had just activated their "outside air" cooling system or something which was the cause of the change in humidity.

Had major thermal issues with a Telecity facility in Amsterdam, no equipment failures just running way too hot in the cold isle. Didn't have alerting setup for a long time, then when I happened to notice the readings it started a several months long process to try to resolve the situation. Never got resolved to my satisfaction though before we moved out.

I remember another facility in the UK at another company that was suffering equipment failures(well at least one device, their IDS failed more than once). The facility insisted the temperature was good, then we showed them the readings from our sensors and they changed their tune. They manually measured, and confirmed the temps were bad and fixed them. Never was on site at that facility so not sure what they did perhaps just opened more floor panels or something.

But just two facilities with temperature issues over the past 19 years of using co-location.

Power and cooling, two things most people take for granted when it comes to data centers(myself included). Until you've had a bad experience or two with either then you stop taking them for granted.

Nate Amsden

Re: We make a rod for our own backs...

Lots of gear can come with DC power supplies. For a while companies like Rackable who was popular in the hyperscale space back before 2010 had rack based AC-DC systems. They were bought by SGI then HP bought SGI those product lines are long dead. I was interested back in 2009 in one of their products called "CloudRack", which was neat in theory, never got to see it in action though. Built for hyperscale, the servers had no power supplies or fans, there was rack level of both that supplied the servers(sample server picture from my blog at the time http://www.techopsguys.com/wp-content/uploads/2011/05/c2-tray.jpg). I wanted to get it for a Hadoop build out at the company I was at. I wouldn't dare use them for anything mission critical of course.

I think going beyond rack based DC distribution is likely to be wasteful/inefficient with the loss in energy over the distance? Thought I was told/read something along those lines at one point. Also I think I was told/read that DC is much more dangerous than AC.

Another efficiency gain is increasing the voltage. I've never seen it used myself but the PDU vendor I use (ServerTech) at one point was pushing 415V (https://www.servertech.com/solutions/415v-pdu-solutions). Unsure how much savings that higher voltage can bring.

Nate Amsden

Re: Turn down the lights!

at least where equipment is at I'd assume almost all data centers use motion activated lighting. I remember the first data center I visited in 2003 was an AT&T facility, really liked that place. I don't recall if they used motion sensors or simple timers on their floor lights(which was one giant warehouse floor with a huge raised roof). Heard stories about that place when it first opened, there was so little equipment that gear was running too cold they actually put space heaters in for some customers to get the temperature to more normal levels. By the time I saw it they had plenty of customers and no heaters to be found.

The practice is so annoying that I install my own LED lighting(utility clamp lamps from a hardware store). I don't go on site often(currently in probably my longest stretch haven't been on site in just over 3 years), but there have been times when I was on site for a dozen hours straight. Not only did my LED lights provide much better lighting but I didn't have to walk around the cage every hour(or more) waving my arms to trigger the motion sensors.

Nothing was more annoying than a Telecity data center I dealt with in Amsterdam for a few years though, the only place I've been at with hot/cold isle isolation. Sounded neat in theory but it was so annoying to have to walk ~120 feet between the front and the back of the rack. Almost as bad as they required you put these little elastic "booties" on your shoes before walking on the raised floor, another practice I've never seen anywhere else(and completely crazy). I hated that place. Ironically Equinix acquired them eventually(after my org cut ties with them). Then it took another 3 years to get off their mailing lists.

But of course the power provided by lights is probably a rounding error given how efficient LEDs are.

Longstanding bug in Linux kernel floppy handling fixed

Nate Amsden

1.9MB floppies

I remember some app back then allowed you to format a regular 1.44MB (which I think was otherwise 2MB unformatted) floppy into a usable ~1.9MB disk, and I think it was done in a standard way no special software needed to read it.

I remember being excited to buy the retail version of win95, after having used pirated betas for a while. Bought from Fry's Electronics (RIP), wasn't paying attention since it wasn't until I got home I realized I bought the floppy version. Ugh. Then I made that same mistake again with OS/2 Warp, though I remember OS/2 being far worse(more disks? slower access?), or maybe that memory was incorrect. Also the OS/2 Warp addon that gave internet support? I didn't use it long just played with it, and multi booted with system commander(?).

Last time I think I was in a situation that required floppy disks was in 2009, needed to update the firmware on many Seagate drives on several older Dell rackmount systems, and the only way to do it was a DOS floppy. Fortunately I wasn't the one that had to go connect a USB floppy to each system and boot/update them but I did figure out what the issue was(after the company struggled to find a solution before I was hired for a year). It was the first time in my career where I needed to update the firmware of a disk drive.

Rackspace rocked by ‘security incident’ that has taken out hosted Exchange services

Nate Amsden

Re: Business Continuity

That very much could be true, I have read some similar stories. I certainly can't vouch for their quality of service never having been a customer only from what I know of what their model was years ago. But support in general from many vendors has fallen off a cliff in that same time frame, which also is sad. I've experienced this myself over the years too. Can't remember the last time I read someone speak positively on VMware support for example(even those in big accounts that spend tons of $$). One of my last HPE 3PAR support tickets I literally had to help their support type in the right commands to get the task done (via HPE MyRoom). These were basic linux commands (the task was to delete some ISO images related to past software updates on the storage controllers to free space). They actually wanted to replace the controller hardware because the internal disk was getting full. I forced them to escalate and not replace hardware when a few "rm" commands to nuke those ISOs was enough. What should of been a 30 minute process dragged out over days, and the final call where we did the tasks probably took over an hour while they struggled to get the commands right until I had enough an intervened.

Fortunately (and I'm sure there is some luck involved) my strategy is building simple yet reliable systems that end up rarely needing to interface with support staff. Also very conservative software versions and configurations further limiting exposure to bugs. I'm almost always years behind the bleeding edge allowing others to experience the bugs and get fixes first. My VMware stacks averaged less than 1 support ticket per year(with front line VMware support provided by HPE) over a decade. Fortunately none of those tickets were too concerning.

I read some comments last night regarding Rackspace's hiring practices, the staff seemed super protective of their domains, and were more concerned with "who you knew at rackspace" rather than "what you know about the technology". Which is a bad situation for sure. All the more reason I like to operate my own stuff end to end (but I do use co-location). I haven't dealt with corporate email(as in managing it) since 2002(which I ran with Linux at the time, the last time I considered myself part of corporate IT, in the years since have been in operations).

Nate Amsden

Re: Business Continuity

I'd be willing to best most(90%+) of the customers did not take their own backups, just like most likely most office 365 customers don't take their own backups. Quite surprising really (maybe I shouldn't be surprised).

Mirrored sites can still be compromised, if anything it may be easier, compromise one site and the replication automatically compromises the other site(s) for you (depending on how it was compromised and what kind of replication). Failures can also replicate, data corruption can destroy multiple sites as fast as your replication can send it.

ISP going down and security compromise are very different things. Myself I have been involved with 3 primary storage array(SAN) failures in my career, all of them took multi day recovery efforts, all lost some data with a risk of total data loss, and in all cases the company did not have good backups, ALSO in all cases the company chose not to immediately invest in better protection going forward following the near disaster. All 3 situations were the most scary of my career as well, and in the two I was directly involved with I pulled an unbelievable amount of monkeys out of my ass to get the systems working again. The first one was early in my career and I was on the ops/app team not the backend team so I just waited while they worked to fix the issue. But I was the one to report the issue to everyone, will always remember the Oracle DBA telling me he almost got into a car accident when he read my emergency alert sent to everyone on that Sunday at around lunch time(with output from the HPUX Oracle systems showing "I/O error" on several mount points from the df command). Spent about 32 hours on a conference call for that, probably my longest ever conference call.

I've been fortunate never to have been involved in a serious security incident(have had to deal with a few stupid hacks from unmaintained systems that I was asked to help with over the years).

I run my stuff pretty well, though nothing is perfect, the best strategy (if possible) is try not to be a tempting target. Rackspace, hosting a lot of customer stuff is obviously not in a position to do that, so they have to deal with a lot more things than I.

Nate Amsden

Re: Business Continuity

Curious what you mean by this. Rackspace has been managing exchange systems for well over a decade at this point so they obviously have a lot of experience there. I assume they still operate their own data centers in many/most cases?(I know their business has changed quite a bit in the past decade). In this particular situation I would say they are expected to provide service levels associated with mission critical stuff for Exchange. That's what the customers are likely paying for anyway. Just not sure what you mean by "no third party hosted service can". Do you mean that only doing it yourself can you provide true mission critical services? Or do you mean only Microsoft can provide mission critical Exchange?(obviously far from the only mission critical app stack out there) or both or other?

Sounds like Rackspace's communication was poor on this, but taking down everything was a good response assuming they did it right away after they determined it was a security issue.

I'm very much pro on prem for everything, at least everything I know(mostly Linux based and I do infrastructure too). I've been operating mission critical internet facing infrastructure for 20 years(as of March 2023), and non mission critical internet facing infrastructure since 1997.

I don't know Exchange(bulk of my Windows expertise dates back to NT4 era) and I find it interesting that so many self proclaimed Exchange experts/admins advocate using Office 365 over hosting it themselves, I guess MS did a really poor job with that software stack, or the average Exchange expert/admin is an idiot (or both). I remember seeing some cool "time machine" based backup systems for Exchange ~12-13 years ago, backed up stuff in real time and you could roll back to any moment with a click of a button(or at least that was the marketing, never saw it in action). Forgot any of their names. Probably put out of business by Office 365(because there are fewer customers running Exchange) though which from what I understand doesn't have anywhere remotely that ability.

I've never been a Rackspace customer myself, every time I looked at the pricing(last time was 12 years ago) it didn't make sense given what I can accomplish with regular co-location. But there are probably lots of customers that really need their hands held on everything, so probably a good solution for them.

VMware refreshes desktop hypervisors, adds Apple Silicon support

Nate Amsden

looks like the same for workstation 16, looking at the link they have been pretty consistent every 2 years they retire support for those products(well at least workstation). Been a workstation user since 1999 (before workstation it was called just "vmware"). Never paid close attention to the support lifecycle. I only just upgraded to workstation 16 myself a few weeks ago(as part of setting up a new laptop, owned a copy of 16 for a while but didn't feel a need to update from 15). Technically my workstation 16 license is from my new job, my personal workstation 16 license remains unused at this time.

side note vmware typically has pretty good sales on workstation (and I assume other desktop products) on black friday/cyber monday. It's when I buy it anyway.

You fire 'em, we'll hire 'em: Atlassian sees tech layoffs as HR heaven

Nate Amsden

Re: Give me On-Prem

Last I heard Atlassian didn't even have a cloud, they use someone else's, so are dinged for very high costs there. At least Microsoft has a cloud.

Atlassian's UI changes over the past several years have significantly soured my view of their main products jira and confluence at least. I was a very happy customer for many years prior.

The GNOME Project is closing all its mailing lists

Nate Amsden

never heard of discourse

But as. Mailman 2 user and admin for 20 years i spent a little time looking at mailman 3 and noped out of it. Built a special system just for mailman 2 on an older distro. My personal lists just have a few messages a month so no real risk.

But wow mailman3 looked way too complex.

Public cloud prices to surge in US and Europe next year

Nate Amsden

Re: On prem hosting

energy is a factor with on prem of course but I think not a huge one. Take my org for example, when Covid hit they wanted to cut costs. My infrastructure runs super reliable, hardly any failures, so I was comfortable cutting pretty much all VMware and premium HPE hardware support for our servers. That took the annual cost down anywhere from $3-5000/year/server (depending on server type) to about $200/year/server. The costs of server support alone(yet alone support costs for other things) exceeded the cost of power/data center space for the environment. It's not something I would do for every situation of course but the option is there. I averaged less than 1 vmware support case per year prior. My config is simple, conservative and has been solid as a rock for a decade across vSphere 4.1, then 5.5, then 6.5. (skipped versions not mentioned company started by moving out of cloud to vSphere 4.1 in 2012).

I took some of our storage and put it on 3rd party hardware only support as well saving a whole lot as well($85k/year -> $12k). I felt comfortable doing this because of the operating track record and my experience and low amount of change in the environment. Network equipment remained on vendor support.

With full public cloud you are hit with all of those extra support fees (included in the cost of the product) whether you need them(with cloud you probably do since they are always fucking with it) or not. Of course most big IaaS clouds aren't paying vmware fees, but they have their own support costs, development costs operations costs managing and updating their systems. Managed hosting has even more costs since they are even more hands on, vs Colo (which I have been working with since early 2003).

You also have the ability to run on hardware for longer periods of time, I have data center switches in mission critical roles that are literally 11 years old next month(planning on replacing soon, my 11 year old 10G switches still not EOL..), other switches are 6 to 9 years old. Load balancers 8 and a half years old(not EOL yet). I have servers still in use - 8 years old.

All of that is tons of cost savings. Things you don't have flexibility with in cloud (or managed hosting) because you can't control those things. Of course everything has to be retired at some point. Our mission critical all flash storage array celebrates it's 8th birthday in less than 2 weeks (0 hardware or software failures during that time). My hypervisor (ESXi 6.5) just went EOL this month so will be looking to update compatible systems to 7 soon(already have the license upgrades from a couple of years ago but postponed the project due to various reasons). I went past EOL on vSphere 4.1 and 5.5 as well before upgrading(which for me means fresh installs I'm not going to upgrade 6.5 to 7, going to start fresh with a clean config, prefer things to be clean even if it's more work up front).

Add to that the ability to over provision, something you cannot do in almost all IaaS clouds(certainly all of the biggest ones).

Add to that all of the biggest clouds they always say "pay for what you use" but really it's "pay for what you provision". If you provision a 8CPU VM with 32GB of ram you are paying for that VM regardless of how utilized it may or may not be. vs on prem you can have many such VMs if you know that they all won't be busy at the same time you can over provision safely.

Basecamp decamps from cloud: 'Renting computers is (mostly) a bad deal'

Nate Amsden

good for them

I saw a link to another article yesterday quoting something like 80% of survey respondents had their CIOs demanding they halt or reduce cloud spending. Good to see. Cloud can make good sense in some cases but it's really more on the SaaS side(and even then not all cases). IaaS has been a disaster from the beginning, PaaS not far behind for other reasons(as such PaaS has had by far the least adoption I believe among the *aaS things).

Myself I realized this situation about 12 and a half years ago shortly after starting my first investigations for cloud. My first potential cloud project had an ROI of doing things "in house" vs in cloud of 8 months, and that was assuming the cloud solution worked flawlessly which I was sure it would not(due to many factors), but it's hard to quantify that at that stage of planning.

Saved my org(which was "born" in the cloud, so one can't say they "lifted and shifted", the people doing the cloud work had at least 2+ years of prior cloud experience so they weren't newbies) well over $10M since moving out in early 2012(while providing greater performance, security, and reliability at the same time). My manager at the time projected the ROI of bringing stuff in house at 7 months. I had to re-justify data center again many times over the decade as executives came and went, some tried to push cloud again, fell on their faces every time as they faced the massive cost differences.

Offered to save my previous org several million back in 2011 moving out(tiny startup spending upwards of $500k/mo on cloud), everyone in the company was on board except for the board so I left shortly after that, that company is long since dead.

Linus Torvalds's faulty memory (RAM, not wetware) slows kernel development

Nate Amsden

Re: Take This With a Grain of Salt

I'm quite sure I had Advanced ECC on Proliant G3.

I quoted 90s because this HP doc says as much

http://service1.pcconnection.com/PDF/AdvMemoryProtection.pdf

"To improve memory protection beyond standard ECC, HP introduced Advanced ECC technology in

1996. "

Nate Amsden

Re: Take This With a Grain of Salt

I was running servers 20 years ago with linux with 3 to 4G of ram and all of it ECC.

Unsure why logic running linux on 4G back then would be different now. Regular ECC is barely adequate anymore. HP's Advanced ECC came out in the 90s and is far superior. IBM came up with Chipkill at a similar time(never used IBM servers myself). Dell never came up with anything but presented an "Advanced ECC" option in their Xeon 5500 systems I remember. It was a different technology vs HP. Dell's advanced ECC came from Intel and while it did the job it removed a third of the memory capacity of the system if I recall right. HP's has no overhead by default.(but can have overhead if you enable even greater protection from dimm failures including online spare memory and memory mirroring).

That said I just got my first laptop with ECC last week. Lenovo P15 with 8 core xeon and 80G of ram. Was kicking myself for not going ECC on my 2016 Lenovo P50 which I still use today, currently with 48G. Not that I've noticed any instability. It runs 24/7. Both laptops run linux mint 20.

Per linus wonder why he didn't just remove the faulty dimms. HP, and Dell perhaps others too can tell you which dimm is bad. I remember the pain involved when I used Supermicro years ago which could not do that, tracking down which was bad was annoying. I'd assume he has at least a half dozen dimms and can lose 2 or 3(maybehave to remove a full bank) and still have a more functional system than a laptop.

VMware acknowledges the wisdom of never buying version 1.0 of a product

Nate Amsden

Not enough time

6 weeks between IA and GA? I would think 6 months would barely be enough time. 6 weeks is nothing. Here I am being super cautious only upgrading to vSphere 7 after 6 went EOL (did the same with vSphere 6 and vSphere 5). The issues that vSphere 7 had should show them that perhaps as long as 1 year between IA and GA is minimum, unless they can prove their code is better otherwise.

Big changes coming in Debian 12: Some parts won't be FOSS

Nate Amsden

Re: Seems like a pragmatic idea

Arcane even for a Linux veteran.

Back in 2016 I replaced/upgraded from my old Toshiba Tecra A11 which ran Linux Mint 17, to a Lenovo P50(using it right now still), which I also put Linux Mint 17 on, everything seemed fine. Wifi didn't work but I didn't know that (yet) because I had no need for wifi at home my laptop stays wired.

Fast forward 2-3 months..

I moved out of my home, put my stuff in storage. Was going on a 3 month trip(and moving to another city when I got back). Checked into a hotel near the airport and tried to get my laptop on Wifi. It did not work. I was confused.

After some research on my phone I determined that the default kernel in Mint 17 was too old, and that the Intel wifi drivers for my chipset only worked on the newer kernel. That just seemed absolutely insane to me, as someone who has run with Linux on their desktop since 1998. I was running a still fully supported distro (at the time), with what was in my opinion, a very conservative hardware configuration (chosen in large part due to Linux support).

I have no problem compiling or installing a 3rd party driver, but the Intel website that had the code made it clear the driver would not work on any kernel less than version X (I don't recall the version number exactly). After maybe an hour or so I managed to get all the software I needed downloaded to my phone and then transferred to my laptop over USB to get wifi working.

But even then it had issues, so I made a "wireless-fix" shell script which from what I see just ran "rmmod iwlmvm; rmmod iwlwifi; modprobe iwlwifi", which got wireless working again in a few seconds when it crapped out.

Since re-installing with Mint 20 a few months after it was released (wanted a fully fresh/clean install, Mint 17 is still configured as another boot item in grub) haven't had to use that script though my wireless usage has been pretty minimal since.

I've never advised anyone to use Debian as their laptop/desktop if they are a newbie, unless they are prepared for some extra work to get stuff going. As per the article myself I stopped using original Debian on my desktops/laptops probably around the time Ubuntu 7 or 8 came out I don't remember, switched to Mint(for MATE) after Ubuntu 10.04LTS went EOL. I continued to use Debian on all of my personal servers up until Devuan came out and switched to it. At work my servers have been Ubuntu for the past 12 years now (originally didn't like it as before I was always RHEL/CentOS on work servers, even though I was Debian/Ubuntu at home).

How Citrix dropped the ball on Xen ... according to Citrix

Nate Amsden

Citrix didn't push Xen hard at all

At least in my dealings with Citrix and those I knew of at the time(this was ~8+ years ago), if Citrix came across a happy vmware customer they did not get pushy to try to get the customer to replace vmware with their Xen. I admired them for that. At least the reps that did that anyway. They weren't going to push what was in their eyes(and mine!) a substandard product (to vmware anyway) on a customer and risk making problems just to close the sale. But if the customer was absolutely interested in cutting costs at the expense of quality then Citrix would happily entertain you with their Xen hypervisor offerings.

My time as a Citrix customer is limited to the past 10 and a half years, 99% of that being Netscaler. Have never used Xen hypervisor, or any of their other products other than I had a small deployment of XenApp(?) for several years, it was a single server deployment, special license that they depreciated almost immediately after I bought it. It ran great for it's use case though (5 users max),just running a few various windows utility apps (such as the windows vSphere client remotely, and the older Java interface to Netscaler). I'd still have it now(despite being out of support for years), but had to retire it when I ditched my windows domains a few years ago, as far as I could tell XenApp would not function outside of a domain environment, so I gave up. Fortunately the number of apps that I wanted to use had dwindled a lot over the years.

I am a happy customer overall, though the first several years spent a lot of time with support working through some Netscaler issues(mostly Access Gateway client related, and what is/was called Citrix Datastream, their SQL load balancing).

AMD was right about chiplets, Intel's Gelsinger all but says

Nate Amsden

Re: RE:chiplets

back in the 90s I got a Pentium Pro keychain from Intel. Still have it, not in the best of shape but really shows the two "chiplets" quite well. Also got a big poster, maybe it was a calendar poster I don't remember of the ASCI Pentium Pro supercomputer? I don't have the poster anymore. Also have a couple Pentium Pro CPUs, at least one of them on the underside with the pins the circuitry is protected by a layer of plastic or plastic like material, I cut most of that off after I got it to see the inside. Another one of my PPros is ceramic(?) so can't open it (easily anyway).

took a couple pictures to show: http://elreg.nateamsden.com/ppro.jpg