Feeds

back to article Lone sysadmin fingered for $462 MEEELLION Wall Street CRASH

On August 1st, 2012, high-frequency equities trader Knight Capital lost $US462 million after automated trading systems went haywire, a mess that has now been traced to a mistake by a single sysadmin. The incident saw the company place orders for many more shares than its clients wanted to buy. Knight ended up holding the baby, …

COMMENTS

This topic is closed for new posts.
Unhappy

HIndsight

Is, as always, a wonderful thing. My boss recently sent round a document describing how complex systems are usually running very close to the point of failure by their very nature. And how what looks like an obvious reason for failure after the fact is not usually so obvious beforehand.

Just glad it wasn't me though that did this one, would not look good on my CV....

5
1
Bronze badge

Re: HIndsight

Just because someone wrote a paper claiming something was true doesn't make it true.

I worked a lot of shops in my day, which ended 4 years ago, three dozen.

Some shops were run close to failure, and the managers and workers there accepted as normal industry practice.

Other shops strictly followed proper programming, testing and change control procedures even though that raised short term costs, and the managers and workers there accepted that as normal industry practice.

6
0
Bronze badge

Actually reading the findings, A to F on page 4 here the SEC report only blames management.

Actually reading the findings, A to F on page 4 here

http://www.sec.gov/litigation/admin/2013/34-70694.pdf

the SEC report only blames management.

So the Reg has it wrong, the Sysadmin made the error but the SEC says the cause was management failings.

12
0
Anonymous Coward

Re: HIndsight ??

Hindsight for a fingered sysadmin... Hummm.... :-)

0
0
Anonymous Coward

The sysadmin

It took one sysadmin to start the problem.

It took many managers, many board meetings and many months to create the toxic, lazy, risk-taking atmosphere to allow such a mistake to be made.

No wonder Knight Capital floundered and merged with anoher company, if they can't be trusted to roll out a new piece of software without loosing their asses how can they be trusted at all??

0
0
FAIL

Re: HIndsight

"Other shops strictly followed proper programming, testing and change control procedures even though that raised short term costs, and the managers and workers there accepted that as normal industry practice."

As I keep telling operational managers, it is Cost - >RISK< - Benefit Analisys.

To many decision takers, forget to factor in the risk costs.

Therefore the argument to, it will take another month of testing and £200k, is this is cheap by comparision to a 10% chance of losing half our customers after a screw up, and a 5% chance of being out of business

This also occurs in equipment procurement, for example the Grob Tutor was asked for in black and yellow high viz paint by the head of RAF EFTS, turned down because it would add £500/aircraft in extra fuel over the life cycle of the aircraft. 2 mid air collisions later, 3 destroyed Grobs, 6 dead including one newly qualified (£3m+ training cost) later, that £500/aircraft looks remarkably cheap.

Something things are not hindsight, it was just a blindlingly obvious risk before the event, skipped over because somebody wanted short term cost savings (on which their bonus was probably based)

1
0

Re: Actually reading the findings, A to F on page 4 here the SEC report only blames management.

But as we all know from reading BOFH, while the sysadmin may be at fault, or should have understood the issue... he will always find a fall guy, and in a good anarchist style, insure that the fall guy is in management.

So the SEC sees only what the sysadmin wants them to see...

(and it always is management's fault.... anyway....)

0
0

No surprise...

So, lack of redundancy, probably because some accountant thought it cost too much.

Lack of testing, because a manager listened to the accountant.

Lack of training, because it detracts from profits.

But it's always the sysadmin's fault. Not the fault of the person who made the choice to not have someone double-check mission critical work. Not the fault of the person who made the decision to not test mission critical work. Not the fault of the company who saw no need to either train or pay enough for better trained people.

No, it's the sysadmin's fault. Of course.

Can you spell "s-c-a-p-e-g-o-a-t"? Shame on El Reg for not biting the hands that feeds you, but instead biting the hand of those who deliver the food.

11
1
Bronze badge

Re: No surprise...

In the management courses I took they told us staff mistakes virtually always have management mistakes as the root cause.

If a staff member makes an error that doesn't get caught until too late it is a management error in training or quality control, just like you say.

Consider airlines, a crash 'due to pilot error' is always viewed as poor training, poor rule enforcement, or some other management failure.

I suppose maybe the SEC investigator lacked management experience. Either that or the report was misinterpreted by the press.

6
0
Bronze badge

re-using an existing field is very risky too

Leaving old code in and re-purposing an existing field are both poor practices.

I can see still having old code that uses records, but they couldn't extend the record length? Storage space too expensive?

In mainframe days, my era, change control would have been simpler because you'd just have the one big server to worry about, not eight little ones.

7
0
Anonymous Coward

Re: re-using an existing field is very risky too

"In mainframe days, my era, change control would have been simpler because you'd just have the one big server to worry about, not eight little ones."

And if that failed you'd loose everything, not just a part of something.

1
4
Bronze badge
Holmes

Re: re-using an existing field is very risky too

I agree both responses and up voted them though both are only partial stories. I have seen old code come back to bite many times, it is a terrible land mine waiting for someone to step on the wrong place. Perhaps someone who was not even there when the mine was dug into the software.

What about those systems often used in industry where you do not have full access to the software and you delete a test set up, not knowing that the supplier does NOT delete the code, only disable the link expecting that it will get over written, 'sometime'. Only for an accident to cause it to become active when new work reaches a point that can cause it to trigger as in this case. Sadly in the case I saw, there was no alarm triggered, the operational system simply accepted the link and churned out no 'charge tickets' until the down stream billing system choked (by accident).

I totally agree that redundancy is not vital it is essential - as is heart beat and cross-checking rather than crass (not) checking, there is a difference. One is a systematic process and control function the other is a management failing.

2
0
Anonymous Coward

Re: re-using an existing field is very risky too

"In mainframe days, my era, change control would have been simpler because you'd just have the one big server to worry about, not eight little ones."

And if that failed you'd loose everything, not just a part of something.

Firstly, back in those days people knew how to spell lose!

But more to the point, if you only had one big server then you'd probably take a lot of care doing an upgrade like this. If you've got 8 little ones then you probably take lots of care upgrading the first one and checking everything works before rolling it out to the rest ... and by the time its got to the 8th its probably seen as a trivial installation that doesn't need to be checked.

12
0
Gold badge
FAIL

Re: re-using an existing field is very risky too

"And if that failed you'd loose everything, not just a part of something."

Don't know much about mainframe reliability do you?

Or their backup practices.

6
1
Big Brother

Re: mainframe days

Yes, the mainframe days, when a 30 minute outage in one year, resulted in the head of IT having to write to the board to explain what had gone wrong, and how he was going to prevent it in future.

Average down time per server these days?, now that the technology has "matured"!

"model office"/integration and UA testing concepts have not changed in the last few decades, it is just they are skipped over, for a more "agile" delivery and business operation.

PS

Note "agile" in this context is how the consultants and board room fad surfers use it, not agile system development and delivery done properly.

1
0
Anonymous Coward

Re: re-using an existing field is very risky too

One point of failure versus 8, plus the network connections and their configurations, plus the extra firewall rules and other security. Every patch to be done in development, test, user acceptance (hope with network connectivity tested too) and, finally, on eight live servers, none of which, in all probability, are really fault tolerant as they are cheap Linux boxes on what would once have been PC hardware or even, as in my last job, virtual systems sharing the physical box with all sorts of other systems/applications, all competing for possibly over-committed resources on the basis that not all VMs and their applications will need all the configured resources at once (of course, it happens and of course your DB or your application die, horribly - experienced it). And, of course, the VM or network or whatever are configured and managed by a specialist group to whom you can only specify your requirements and hope that it suits their strategy for meeting budget constraints.

Suddenly, one or two large servers on highly fault tolerant (power supply, memory boards, disc etc.) hardware and much simplified network connectivity with decent load capacity look interesting. Of course, they cost a lot initially; supplier support is probably not the cheapest. But now your critical service has fewer points of failure and software maintenance is a lot easier to plan, quicker to do, quicker to back out if necessary. As someone else pointed out, the standard uptime and performance of mainframes is in a different league from cheap, distributed systems.

I remember, following a high profile system failure at another firm, through poor procedures widely reported in the press, raising it with my management: it was in a foreign country, at a different bank and not understood as relevant or a chance to learn from the mistakes of others. Well, it would have needed effort and thought at a management and budgetry level. It also involved outsourcing to India; as that was and is the mandatory policy, it was very inconvenient so best ignored.

Lesson? Note any weak point discovered. Report them and keep a copy, with recipients, date and reaction or lack of. It will not help to get action. But it may provide protection later. If you are the SA put in this invidious position: no matter how self confident you are, make a written plan, get it reviewed and do it in anal detail, down to the full command lines or screen captures for each host and exact timings from the practice run you did on the test systems (did n't you?). Do not assume you will be in top form on the day or remember everything when the upgrade is running late or a network outage interrupts. Try and insist on a "four eyes" principal.

Yes, it's boring; it's not "agile" and it is not "clever". But it decreases the chance of disaster. Oh, and do monitor really closely for a full business cycle, whether that is an hour or a month, to detect problems and handle them before they reach disaster proportions (difficult in high volume transaction processing such as a trading environment; but try).

1
0
Silver badge
Unhappy

I think the Reg's bye-line of blaming a lone sysadmin is a bit harsh on the sysadmin. Sure, the final error was that the sysadmin didn't do it right. But surely the larger error was the company not having better deployment procedures in place to reduce human error?

5
0
Silver badge

Just for info - byline is the line saying who the article is by. Strapline is probably a better term to use. HTH

1
0
Silver badge

That'll teach 'em

That'll teach the beancounters not to question the BOFH's procurement orders.

(Yup. You're right. It probably won't. Halon is more effective.)

2
0
Bronze badge

There is a very simple rule :-

"You can't check your own work"

You may think you can, but you can't.

If it's important, get a second or third check.

14
0
Silver badge
Flame

Play with fire, get burnt. T'was the sysadmin!!

Really, in that environment, they should make sure the stuff's quality is commensurate with the losses. To all evidence, this means "interplanetary robot / nuke plant" - level quality. Of course, this is not possible. No-one is gonna pay for that so procedures have holes, code is shoddy if not simply of unclear design written by overpaid Prima Donnas and, really, the environment is one of those which are hard to design for, hard to predict and not necessarily amenable to modeling, ergo a lot of "testing" will happen while in production.

DO NOT BE SURPRISED!

0
0
Bronze badge
Mushroom

Don't shoot the sysadmin, shoot the programmer!

re-using an obsolete message flag for a new purpose is simply lousy programming.

I've already seen (and been bitten by) this, as well.

Programmers of this world - bits are not so rare, that you have to re-use them for new purposes. If you need to send new information, please use a new bit / flag /message, whatever. Oh, and make sure that you use the latest interface description and also update your changes to it. You DO make interface descriptions, do you?

Reusing obsolete messages for new stuff just leads to a big mess and to conflicts that no compiler, linker or run-time check can ever find. Even a peer review will probably not safe you (or more precisely your customer), as they tend to review only the module and not the interface. It's just asking for trouble.

IMHO it's the programmer who wrote this should be crucified, not the poor sysadmin who stumbled over the mess others left for him.

Of course the programmer is probably sitting somewhere outside of US legislation (India / Vietnam / China, wherever) and is a bit difficult to use as scapegoat.

And of course people are used to pay for shitty software that disclaims any liability, sold by the usual big name software vendors, so they take it for granted, that bad programming cannot be counted as the reason for such a disaster.

7
0
Bronze badge

Re: Don't shoot the sysadmin, shoot the programmer!

Bit and field re-use is a risky process, but we all know how the conversation went.. between the lead developer\ manager and the programmer

m) We need to add a new system flag..

p) Ok I'll mod the schema and add a new field..

m) NO If you do that we'll have to submit it all to change control and we've already agreed the budget for this

p) How could you agree a budget without knowing what the change will be,

m) Look we've got a whole load of stuff left over from <insert-legacy-system-name-here> I worked on that years ago we can re-use that.

p) Sure do you want me to scan any other systems that use this table

m) No time, we've agreed a delivery date of Friday..JFDI

BTW M&P are not bad people but as anyone who has worked in a financial firm will tell you, The culture is driven by the big beasts at the top who see anything or anyone coming between them and their bonus as scum to be sidelined or removed.

4
0

why was internal IT upgrading specialist software

Why didnt Knight get the actual trading software company to upgrade their software? Didnt want to pay for their support charges?

0
0
Bronze badge

Re: why was internal IT upgrading specialist software

My understanding is that it was their internal software.

1
0
Anonymous Coward

"Knight did not have a second technician review this deployment ... ".

So it was a management / procedural fcukup rather than a sysadmin fault. Peer review is standard practice at all the places i've sysadmin'd.

But hey, when you cancel the daily office hygene-maintenance contract in favour of a 'self'service' CYOFM (clean you own fcuking mess) approach to save money, you cant blame the cleaning staff for unplugging teh servers to plugin their vaccuum. Handy to know the sysadmin is the next in line in the blame-chain.

1
0
Silver badge
Facepalm

And the point of the story is...

PEOPLE MAKE MISTAKES!

Sorry, that's not the point. The point of the story is:

YOU PUT CONTROLS IN PLACE TO MAKE SURE THAT THE MISTAKES PEOPLE MAKE ARE CAUGHT BEFORE THEY DO DAMAGE

Sadly the bean counters are unlikely to be reading this comment, and all they'll see from the media is "sysadmin made a mistake"

5
0
Gold badge
Unhappy

The "One bad apple." Of course. A favorite of police forces everywhere.

No the truth is something like this is at the end of a very long line of corporate cultures that at best allow and at worst encourage this sort of f**k up.

And BTW when we talk "redundancy" there were eight servers this was rolled out to.

Wot, no automated roll out procedure?

BTW I think the classic legacy code FUBAR was that for the reusing the Ariane 4 GNC code on the Ariane 5. The section that ultimately stuffed the maiden flight was a)Not used at all in the A5 system (it's functions had been superseded) and b) What it treated as an excess rate was well within A5 design, so it should not have been run and the problem it helped with did not exist in A5 by design.

2
0
Gold badge

Re: The "One bad apple." Of course. A favorite of police forces everywhere.

Since you started talking about redundancy, we also have to carry that on to the A4 / A5 example. In this case, the legacy code caused a problem that was spotted (overflow storing an int16 in an int8 or similar), so the nominal computer was effectively sidelined, to allow the redundant to take over.

The problem was that since the failure was inherent in the design, rather than the failure of a part, the redundant computer made the same error, and was also shut-down.

Pleople often forget what redundancy protects against.

3
0
Silver badge

Blame apportioned wrong

The sysadmin was not to blame, reusing the PowerPeg flag was wrong. So it is the programmers' fault.

And I say that as a programmer myself.

1
0
Anonymous Coward

You'd be amazed at how many changes are made on the fly...

Having worked on HFT systems I have to comment. You would be amazed at how many changes are made on the fly without care and consideration to investors. Think FIAT: Fix It Again Tomorrow as a working mantra. Part of the problem is this. You can't recreate a live complex organic trading system in beta. Often the beta isn't even running the same release as the live system. So it isn't just a question of numbers i.e. only testing 10% of the orders. Its also a question of real-world complexity in the way orders are placed and in the complex interaction between all the different players.

Complexity is exponentially increased with automated market makers (AMM) and their interconnected exchanges, HFT systems, hardwired bank and institutional screens, retail systems, and legacy phone or pit orders etc. When you sprinkle in AMM stock-pinging, liquidity-rebate-trading, front-running, fat-finger trades, legitimate cancelled orders, and competing orders from co-located servers versus those at a distance... life can get very messy.

Its very difficult to build a good simulation. I wish the regulators would acknowledge this and herald it as warning.... Instead they continue to see these events as one-off problems in tunnel vision fashion. So when-is the next flash-crash or IPO non-event...? ...FIAT!

3
0

Re: You'd be amazed at how many changes are made on the fly...

Amen, I remember that in my coding days linking up to the OMLX and DTB. The beta markets were dead, had no prices in them, had no volatility, no rate of change, no where near the amount of derivatives listed, etc, etc, it was a nightmare. Plus, as you say, the way a market kicks off when the AMM kicked in. In the end the only solution was to stub the in/out of the exchange and record real data over a period and then play it into the system and compare the "new" systems behaviour to how it should have worked. Couldn't capture API problems to the exchange (hopefully the beta site did that) but at least we knew how our system would react to certain inputs from external systems. It sounds like this would have spotted that one of the systems wasn't giving back what it should have done in this case but, it costs money, it costs time, it delays things.

Thank heavens I don't work in that industry any more as I really don't think I could stomach the JFDI attitudes knowing what was at stake if you make a mistake. Complete failure of management controls or risk assessment, sadly common across most industries.

1
0

Re: You'd be amazed at how many changes are made on the fly...

>Its very difficult to build a good simulation

You mean impossible.

You can't build a decent simulation of the market when so much of the current behavior of the market is tantamount to abuse of the system. When a new strategy is successful it can rapidly become the dominate behavior in HFT systems in a very short period of time, risk be damned.

0
0

Re: You'd be amazed at how many changes are made on the fly...

Good thing the complexity is all man made ... so it can be man-unmade with just enough right motivation.

0
0
Anonymous Coward

Lone SysAdmin?.... Or more assumptions being made in the name of maximizing profits

I learned a lot about the frailties of Algo Trading HFTs. Here's three reasons why Algo Trading / HFT's mess up. I was working at a large bank for US interest rate traders once. As an example, the traders wanted a mechanism to move large blocks of 30 year bonds without moving the market. So the goal was to design a black box (BB) to move small parcels throughout the day, trading on defined limits of the market at that time. The latter phrase proving crucial here!

The problem is: whenever its a busy morning the BB is reliant on timely receipt of MESSAGES and the timely order of those messages from the exchange’s trading systems, which is itself a kind of BB. And all of this withstanding the usual glitches like network latency, bottlenecks, outages etc What I was discovered was :-

#1. You can't recreate a live market in Beta. Why not? A. Because the beta system is not anything near as liquid, B. the beta is often not even running the same Exchange software version, and C. your black box is playing with other robots-- and not humans and robots and real-world actualities.

#2. There are significant subtleties in the way a busy morning can affect the ORDER in which your black box receives messages into its queue! I found cantor's e-Speed system sent messages out of order making it tricky to pair up last trade prices with current market pricing. It meant trade confirms with the actual SIZE moved were delayed well beyond where the market was now. In short, my BB frequently found itself in ill-defined state.. What to do next.…? Where was the market now? How much size was actually executed? What should the next BID / ASK posting be?

#3. The exchange provider only shares some technical subtleties with its clients, that is-- unless your are one of its darlings that generate the most fees: hedge funds and Goldman, Deutsche et al... What this means is that you are often operating in the dark. In addition the Exchanges systems can be the actual cause of the glitch. Without warning Cantor re-numerated trader operator ID’s overnight. Anyone who hadn’t re-logged on was unwittingly now using the wrong trader ID. Boy, that was messy! Traders getting another traders confirms, trades executing in the wrong books etc.

0
0
Anonymous Coward

Back in the early 1980s

I worked for the Council of the Stock Exchange. We took backups of all the day's trades, every day, for use in integration testing (around an hour using 4 tape drives in parallel, making copies of the live tapes, for those interested). I realise that the environment is much more real time these days but it can't be beyond the wit of man to snapshot a postion, then serialise the transactions by time over a period of time to simulate real-world trading with, ahem, real-world trading.

1
1
Anonymous Coward

'serialise the transactions by time over a period of time to simulate real-world trading'

The devil is in the detail AC08:17... You're like the a regulator- you just don't get it!

Of course its not difficult to do this, but it completely ignores the 'handshaking' and negotiation that is the required to communicate with the exchanges. This is as much an accounting and risk issue as anything else. In a rapidly moving market, and lets be clear, that's the kind of volatility that HFT's and AMM's love to exploit, automated trading apps place orders and receive trade confirms in fractions of a second. And with co-located servers and ever faster cables being laid, things can only get more interesting. Trade confirms tell traders what their positions are by telling them where the market is right then. In a simple sense, they help traders decide whether to buy more or sell more or hold. But with machines these real-time checks are often skipped because of time pressures...

Lets go back to the Knightmare. They are 8 servers but one was rogue. They had clearly optimised the servers for raw speed, there was little time remaining for calculating positions or book risk, so few checks were in place. So the systems kept on trading with zero regard for the actual positions being accumulated on the books. That led to disaster! Ok, so why didn't they tot up their positions, because that takes CPU time, database access, quant library calculations and network latency etc. Whereas they wanted raw speed to beat-out their competitors from other firms with competing AMM's and HFT's!

Now lets look at a more general case. On a busy day where the exchange has hick-ups, Trade Confirms can often be delayed. This is what happened with the Facebook IPO. Investors didn't know if they had gotten their share of shares. So they sweated over whether they should resubmit orders, risking over-subscription. What should they do, buy more, or hold off? Its another Knightmare in this scenario. As it happens many had been wrongly assured the price was going to skyrocket, so they resubmitted, compounding the existing lag!

In the posts above I detailed the complex interaction of getting timely 'messages' from the exchange. When you don't you are lost at sea from a financial perspective. The code may be functioning correctly, it may have been bullet-proofed, but because it has no idea where the market is at that point in time, it is effective lost. So this is a real-time accounting and risk valuation issue more than a tech issue. But at the speeds these programs are operating at, there is a potential for rapid freefall in the market meantime, and therefore risk of apocalypse. Moreover, there are orders sitting on the side-lines, waiting to be executed should a price point be hit, i.e. stop losses etc, if these are triggered, as they will be in times of high volatility, there will further freefalls, and it will cause a chain of cascading price collapses.

I don't think the regulators have any idea what they're up against. If they passed a law requiring these firms to calculate their positions after every block of trades, it might alleviate some volatility, as it would force them to stop trading and wait for any delayed confirms. But I doubt this would ever be enough... Regulators should work at HFT firms for a time to have a better appreciation of this house of cards. As another poster pointed out, the JFDI attitude of these shops is compounding the problem too...

0
0

Re: 'serialise the transactions by time over a period of time to simulate real-world trading'

"I don't think the regulators have any idea what they're up against"

- I think this has been proven many times in the last 10 years

"If they passed a law requiring these firms to calculate their positions after every block of trades, it might alleviate some volatility, as it would force them to stop trading and wait for any delayed confirms."

-yes they would have to stop their casino operations, and go back to being a boring old fashioned financial instution that facilitates the raising of capital for economic growth, rather than being a gambling operation, not generating or contribruting to the real economy

If you want to know why after such spectacular screw ups in the gambling halls of wall street and the square mile in the last decade, that these operations are not being clamped down on by the regulators, take a look at your elected representatives funding and election support services, and the revolving door (more like a hgh speed turbine these days) between politicians&regulators and top jobs in the banks, etc.

0
0
Anonymous Coward

If we haven't already learned this...

How long do you think it will be before your boss issues a new policy manual about how you work and who checks your activities?

I don't know - what's the weather forecast for Hell?

Certainly these types of severe, public errors often become lore, and are repeated more or less accurately (usually less) with some mix of gravity and schadenfreude around the industry. The Theravac, Iran Air 655, and others cost lives; Knight Capital and the like led to huge financial losses; CHRISTMA EXEC and the Morris Worm tied up resources. I have half a dozen books, written for a non-technical audience, dissecting IT-spawned disasters.

But these incidents haven't led to sweeping changes in the industry, and there's no reason to believe that another one will have any different effect.

0
0
This topic is closed for new posts.