How do you mess that one up?
The query explicitly states that it will return one row at most!
How does a memory optimisation ignore the explicit limits set in a query?
Microsoft techies have pinpointed the SQL query that smashed the software giant's Visual Studio Team Services offline for several hours. It appears the query triggered a crippling SQL Server 2014 bug that Redmond is now trying to fix. Last week, Microsoft's cloud-based Team Services suffered a five-hour outage, leaving …
The query may return one row but that does not mean that intermediate stages consist of one row. If it was constrained to processing only one row at a time in memory, it would probably become extremely slow.
This problem of wanting to return one row but having to process an awful lot of rows to do it is an extremely common problem which frequently causes memory issues. The question for me would be if the schema or views need changing in some way to avoid the issue.
In my admittedly crap hackish experience with SQL, if you're doing TOP 1 or LIMIT 1 or whatever then you're probably doing it wrong anyway because you've got a load of different results for the same query then you throw all but one away. The one you're left with might even be the right answer.
Yes I agree, it is definitely a warning sign. If there is no way round the SQL Code to do what you want without using TOP 1, then you need to have a look at the schema and maybe redesign.
But I guess they say write first, optimize later if needed, which is probably what happened here.
"But I guess they say write first, optimize later if needed"
That's supposed to happen in development, not in large scale customer facing production code!
But then I have been asked why I needed a full 16Gbytes and a replica of a production database for query testing. Aren't a few thousand rows and a VM enough?
"Especially since I can't see an ORDER BY here"
The query is supposed to return one row, so an ORDER BY on the output is nonfunctional.
There doesn't seem to be any table where ordering would be likely to short circuit the process, so adding any ordering will just make the whole query bigger and more expensive.
A) You're sure your query will *always* return one row only, then TOP is useless - and if for any reason an unexpected result is returned, better to see an error/exception than taking blindly the TOP one even if it is the wrong one - leading to data corruption.
B) Your query returns more than one row, and you need the TOP one - unless you enforce some way the retrieval order, any changes in the way rows are returned will lead to a different result. This may be especially an issue with non-rule based optimizers, which could still change the access path even if "hints" would suggest another. IMHO relying blindly on the "actual implementation" is just too risky.
"IMHO relying blindly on the "actual implementation" is just too risky."
It is not "risky" it is wrong! And when it breaks, the shit will start flying.
I get so tired of explaining this to morons, one despairs! Just because it "works", does not mean it is semantically or theoretically correct. The amount of "dodgy" code I have ficed in the SQLServer environment where idiots simply assume order because there is an index that just happens to get chosen by the optimizer is very large indeed.
Without an ORDER BY, there is NO ORDER in the returned results, even if it looks like that at first flush. Learning some relational theory would be a good thing for a very large percentage of the world's programmers and sadly a lot of DBAs :(
Except in SQL Server, the ORDER BY is also used to determine which rows to include in TOP. In this case, it wouldn't be used for presentation sorting, it would be used to guarantee predictable results (e.g. always the *same* TOP 1, if the data hasn't changed).
"The query explicitly states that it will return one row at most!"
Although the aggregate function will _return_ one result, the dataserver has to first produce a potentially much larger result set which is then fed through the aggregate function to derive that result.
The optimiser has to deal with the entire query plan, not just that final aggregation step.
Not really, you'll have the same problem on Postgres at least. Some of the new DBs are really awful - Amazon Redshift, for example, dies (as in full cluster reboot, come back in 15 mins) like this on anything with vaguely inaccurate statistics, the various Hadoop based SQL engines just abort and MySQL just can't do big tables at all.
Oracle and SQL Server have amazing query optimisers. SQL Server often has these issues in new releases and they get patched, 2005 was awful at the beginning but turned into a very stable database after sp1. I'm not a big Oracle user but I imagine that they have similar issues.
Postgres actually doesn't allow query optimization hints. It's one of the things I like about it:
Oracle I have never liked. IMO, there are two databases that should be used outside of special cases, and these are Postgres and SQL Server.
Window cleaner and Surface Expert brigade are apparently in full swing this morning, keep 'em coming, the downvotes.
Everybody on here knows that MS software cannot scale. You will notice that they "fixed" the issue by using a hint that allocates 1% of the 4Gb their "optimizer" (ROFL) thought was required, for one row, serious fail, if you ask me ... the fix is not good enough for me, I'm pretty sure 0.002% would suffice.
Don't ever let Windows cleaner and Surface Experts near your RDMS, I tell ya!
Now, fanboyz, anything to comment ?
Actually, one of our clients is a mega bank, you know, a bank that comes from some European country well known for chocolate, cheese, precision chronographs (or whatever they call watches this week), and banks ... they happen to use our software to handle financial transactions ... the data is in Oracle Database clusters, now, the dbs are actually "quite" large, the SQL Server database that fell over here is nowhere near that size, seriously, not even in the same universe. And the thing puffs along quite happily, not a database performance issue in years, afaik, and, they would knock on my door if anything goes wrong.
There is no way in hell you could run that off SQL Server, they would laugh you out of the building.
SQLServer has, and has always has had, a brain dead optimizer - especially with non-trivial table cardinalities. I did not notice very many instances where this improved in SQL2014, though I suspect that there were improvements.
The bottom line is that big tables pose problems, as the optimizer makes assumptions that go seriously wrong. This is partially due to "statistics drift", as automatically updated statistics, aren't, due to the "black box" algorithm that is hard coded. The problem gets out of hand when large numbers of physical partitions are involved.
There are many simple cases that optimize so poorly that it may choose to scan the table in obviously wrong situations. A simple select of a row using the primary key can sometimes result in a table scan, always wrong in a table with more than a few pages worth of rows.
Just limit the maximum size returned by the predictor to be 100MB (or a per database defined limit) - this will slow down some true large queries but will stop the system from crashing. For a more permanent fix add an option HUGE_QUERY that overrides the limit for the few queries that require a massive amount of RAM.
These fixes would only require a tiny amount of additional code and no changes to the body of the predictor which would reduce the chance of new bugs being introduced as a result of the fix.
Simple problem is that customer A doesn't want their queries slowed because of a problem they don't face with their usage pattern. You don't need to be very imaginative to consider a big data analytics required by a small team of researchers for whom allocation of 10s of GB per user is an average day and no problem. They would be hit for 6 with such a change.
It isn't just the slowdown either. SQL server uses locking concurrency control* (usually). Even very small decreases in performance can exponentially increase problems with deadlocks.
* as opposed to MVCC / MGA etc
Actually knowing your vendor will be upfront and wear their mea culpa in public rather than behind legalese (cough vtech) means that it is a culture of learning from your mistakes rather than shoving it under the blanket or finding some rogue engineers to blame (cough VW).
Now here is where I should feel smug for internally housing our SVN but I suspect that any outage we may suffer will be quickly blamed on some SAN or switch or someone else.
I've been working around query optimizers for a long time and one of the things I've learned is that they are rocket science. They are incredibly difficult to get right and a small change that improves one query can affect others negatively. That's part of why SQL Server gives you the ability to control (through trace flags) which query optimizer compat level you are using. You should never change the compat level on a large scale database without testing. As I said in my post, the responsibility for this outage it mine, not SQL's. Yes their new cardinality estimator misestimated this query. They will tune it to fix this query while keeping the millions of others working well. This is the business of query optimization. I view part of my job as a large scale Microsoft service as providing a great test case for our platform services. As some of you have surmised the issue is with intermediates, not with the result. So the top 1 alone isn't enough to tell the estimator the answer - I should have mentioned that in my post. I truly believe, there's no incompetence on the SQL Server side - and I'm not just saying that because I work at Microsoft. Those folks do some amazing things and I'm grateful for it.
I just want to say, that the openness that Microsoft has shown in explaining this problem, how it occurred, how it was addressed, has scored major points with me. As someone else said up-thread, it shows a culture of learning from mistakes and I see that as an extremely positive thing in a company. After all, everyone makes mistakes - that's mandatory. Learning from / being open about them - that's optional.
You have clearly never investigated how many broken optimisations the optimizer comes up with, including many that simply give the wrong results.
Microsoft just ignores them, and has been ignoring them for quite some time. I expect they will continue to ignore them.
In essence, they don't care, because they simply do not understand how critical a "wrong answer" optimization is. Oracle, on the other hand, is usually pretty snappy fixing "wrong result" types of error.
Much as I do not like classic Oracle (I love OracleRdb of course), they at least understand data, and they can deal with very large databases far better than most imagine in their "red mist" dislike of LE.
Sometimes the estimator can get things wrong. Often it's because statistics are out of date or missing and it has no way to gather a correct estimate. Usually memory grant problems go the other way and the issue is that not enough memory is granted.
The person who wrote (or "fixed") this particular query has taken matters into his or her own hands by including many hints which force the optimiser to take a specific approach when generating a plan.
INNER LOOP JOIN
WITH (INDEX=IX_tbl_Identity_AccountName_TypeId, FORCESEEK, UPDLOCK, HOLDLOCK)
OPTION (OPTIMIZE FOR (@partitionId UNKNOWN))
Using query hints to boss around the optimizer is widely considered to be fraught with danger. From the use of these hints in the first place it's clear that there have been performance issues with this query before this problem cropped up. The solution might be as simple as adding an appropriate covering index. It's also difficult to guess what's going on with the temp table #identitites so we're missing the full picture. The use of MAX_GRANT_PERCENT = 1 seems like just another nasty hack and not the final solution.
The presence of these hints should raise a red flag during code review and should definitely be reviewed before upgrading to a new optimizer.
IMHO this should be considered a quality process fail for the VSTS team rather than an SQL Server problem.
I agree. The code expends more effort bending the optimizer than returning the result. As a DBA at my last client, I spent an inordinate amount of time unwinding this sort of crap (and much worse).
There is a reason that data people who know how databases *actually* work and how to design them logically and physically get paid well - we are few and far between, and the number of "programmer designed" databases seems to be growing exponentially.
SQL wants 4GB of RAM? Tut, tut. No ambition.
I've just farewelled Opera which, in its long-term version 12.xxx for Linux, would after a period open on the desktop and having stealthily taken over just about all the RAM available, make a mighty asset-grab and lay claim to 16777216 TERABYTES of virtual memory, freezing everything else for an undefined period of time (at least days, but I couldn't spare my laptop from active service for a few centuries while waiting to see if it would eventually blink).
Footnote: Opera continued a development path on Windows and eventually picked up Linux again at about version 34. Unfortunately this is missing the integrated e-mailer which is what kept me faithful for many years, so I'm now free to mix-and-match browsers and e-mailers like all the poor unwashed who never knew the benefits of great design. Only thing that keeps me away from Chrome is Google.
Biting the hand that feeds IT © 1998–2019