Looks like each site was only tested once. Seven measurements for 30 seconds after the page was loaded but I could only find that the page was loaded one time. So it's entirely possible a background task like indexing for the spotlight search or an hourly task impacted the result for the NYT and made it look like it was the coding. Loading the test five or ten times for a reloaded page (ideally restarting the browser and loading the page fresh from a blank page) would smooth out those events. I'm not saying that it isn't the NYT but with a single data point it is hard to make assumptions.
While there was a good effort to try and stop other processes from running in the background, such as turning off Time Machine, there are many behind the scene tasks and jobs that can't be turned off and can impact with the results.
It might have been better to start the Mac in "safe" mode where a lot of these extensions aren't started. Then create a new profile in Firefox with no extensions and have it start up with a blank page. For each page to test start the browser, go to the page, wait until it loads, run the measurements, and quit the browser. Repeat the test until you get your five or 10 samples. Move onto the next page to test in the same manner.