back to article FYI: AI tools can unmask anonymous coders from their binary executables

Talk about the ultimate Git Blame. Programmers can be potentially identified from the low-level machine-code instructions in their software executables by AI-powered tools. That's according to boffins from Princeton University, Shiftleft, Drexel University, Sophos, and Braunschweig University of Technology, who have described …

Anonymous Coward

You got me, I'm the hacker that still uses goto and I never use OOP because procedural programming is much more 733t.

I would love to believe this is possible but an accuracy rate of 65 per cent is neither here nor there. It does work well when identifying state sponsors though.

6
0
Gold badge
Big Brother

"It does work well when identifying state sponsors though."

You have it backwards.

It's very good for states to find anyone writing code they don't like.

For those sorts of state a 35% failure rate is acceptable. *

*"Better a 100 innocent men are punished than a single guilty man escape" as a well known psychopath once put it.

12
0
Silver badge

I'm surprised

I'm surprised that it took this long, really. Anyone who's worked long enough with specific developers learns how easy it is to tell what code they've written based purely on their style. It's as unique as a fingerprint. It always reminded me a bit of the fact that in the old days of the telegraph, telegraph operators could identify each other based on their particular keying rhythms.

20
0

Re: I'm surprised

I would agree with you if we were talking about source code. But after passing through a compiler?

10
3
Silver badge

Re: I'm surprised

And how handwriting analyzers can determine likelihood of a particular person writing through "grown" characteristics of the writer (style characteristics basically developed as a person acquired the skill to write).

3
0
Silver badge

Re: I'm surprised

"I would agree with you if we were talking about source code. But after passing through a compiler?"

The compiler is still basically directed by the source code, so the end result is still going to preserve the essential coding style of the original writer. Code optimizations and code munging can change things some, but it's more like distorting a person's signature; the essential style characteristics embedded into the original code will still be there if you look carefully enough.

6
0
Silver badge
Meh

sample set is too small

Seriously, the sample set is too small. If they'd used THOUSANDS of coders [or better still, MILLIONS] and been able to get a 65% accuracy on determining "who wrote this", I'd be impressed.

And in the case of finding out who wrote an "illegal" program, this is what you'd have to be able to do.

No fear necessary.

10
1
Silver badge
Boffin

Questions questions...

1. How well does it work if the programming language is uncommon and can't be determined from the binary?

2. How well does it work if #1 is true & the programming language is Assembly?

3. How well does it work if #1 is true & the high level language compiler allows inline Assembly so the programmer can randomly jump between the language & Assembly?

5
0
Anonymous Coward

Re: Questions questions...

#4 and how dependent is the technique on the unique usage of libraries and other tools used by the programmer (versus the "programming style")

4
0
Bronze badge

Re: sample set is too small

That you are paranoid does not mean they are not out to get you.

On the other hand, after three days without coding, life becomes meaningless.

6
1
Silver badge

Re: I'm surprised

Yes, even after passing it through a compiler -- although you need a larger code sample to be able to tell with accuracy, because you're relying more on macro patterns than micro patterns.

0
0
Silver badge

Re: sample set is too small

If they really are out to get you, then you aren't paranoid -- you're correct.

1
0

Re: I'm surprised

"I'm surprised that it took this long"

Quite. IIRC something very similar was achieved in the world of regular books a few years back - reducing a given author's style to a digital fingerprint. A useful tool for proving the provenance of disputed authorship.

I see no real difference here.

1
0
Bronze badge

My code is VERY EASY to identity...you can actually READ IT and UNDERSTAND IT!

That is what happens when your first programming languages are PASCAL and COBOL...you have no other choice BUT to write easy-to-read and therefore very-identifiable code! I would be easily and QUICKLY found out by any investigator.

I know one C programmer who would NEVER be able to be identified by this method because NO-ONE except himself can read his code and the only reason he is employed at all, is that his code is the FASTEST CODE AROUND PERIOD for embedded processors and specialty applications! He hasn't had to work on a financial basis since the early 1990's but because all the "Big Boys" of industrial and consumer hardware want him for his superior speed-up expertise, he keeps amassing a very large fortune by writing the world's most UNREADABLE C code!

He knows every CPU optimization of every part of the various C compilers he uses down at the assembler code and register-usage level and there is NO WAY his type of code could be profiled at the assembler/binary level using "Stylistic Differentiation"...the optimizations he creates cause the compilers to output only the most basic and reduced instructions.

9
9
Silver badge

Have you ever thought that what you describe in itself is a coding style? Meaning he CAN be identified?

23
0
Bronze badge

TRUE! But I think that if he wanted to, he can just mask his optimizations and makes his code look like generic output. Anyone who knows how to modify operating system and hardware driver assembler code WHILE ITS RUNNING can mask his code-style traces to any level he so desires.

5
3
Silver badge

But that would be like altering one's handwriting to mimic another: unnatural to the practitioner due to force of habit. Plus I don't think there IS such a thing as "generic" code since most code is made by man, which means each snippet will have a style signature.

7
0
Silver badge

UN-altered REPRODUCTION and DISSEMINATION of this IMPORTANT Information is ENCOURAGED, ESPECIALLY to COMPUTER BULLETIN BOARDS.

3
1
Silver badge

"I know one C programmer who would NEVER be able to be identified by this method because NO-ONE except himself can read his code and the only reason he is employed at all, is that his code is the FASTEST CODE AROUND PERIOD"

That sounds like it would be exceptionally easy to identify.

14
0
Anonymous Coward

"My code is VERY EASY to identity..."

My style analysis tells he can be the same guy writing here, which led to more than a good laugh:

http://www.canonrumors.com/forum/index.php?topic=33975.0

We are waiting for its magnificent code to appear, and perform a style analysis on it.

2
0
Silver badge

My code is VERY EASY to identity...you can actually READ IT and UNDERSTAND IT!

I can THINK of ANOTHER reason that SOMEONE could POSSIBLY IDENTIFY your CODE.

7
0
Silver badge

"UN-altered REPRODUCTION...."

Uh oh, nobody talk about turkeys.

0
0
Silver badge

> the only reason he is employed at all, is that his code is the FASTEST CODE AROUND PERIOD for embedded processors and specialty applications!

Maybe you could forward his CV to Intel. Heard they may be interested in someone who can work the fastest code around period.

1
0
Bronze badge

Re: "My code is VERY EASY to identity..."

Ya Got Me! --- I'm ONE AND THE SAME!!!! and YES my CODEC will be released very soon now. I do have a day job and my employer needs my expertise in video production and coding (everyone does everything.at this company - i.e. Multi-tasking!) so I can only work on it in my off-hours.

Here is the basic outline of the code which is MOST READABLE AND UNDERSTANDABLE

quite unlike my colleague from years ago:

Threadsafe_Global_Variables:

Final_Video_Output_Filename : Character_String_Type;

Frame_Buffer_Images,

Processed_Output_Images : Array[ ONE..Maximum_Frame_Group_Length ] Bitmap_Image_Type;

Threadsafe_Global_Constants:

Maximum_Frame_Group_Length = 120;

Program_Begin

Show_Destination_Output_File_Dialog( Final_Video_Output_Filename );

Call_High_Resolution_Interrupt_Timer( ONE_HUNDRED_TWENTY_TIMES_PER_SECOND );

End_Program;

Define_CODEC_Procedures_and_Functions:

Procedure Interrupt_Timer_Event_Handler( Number_Of_Frames_In_Group: Signed_Integer_Type );

Var

x, y,

Frame_Number : Signed_Integer_Type;

Destination_Video_File : File of Compressed_Video_Frame_Type;

Begin

Try

Keep_Within_Limits( Number_Of_Frames_In_Group, ONE, Maximum_Frame_Group_Length );

for Frame_Number := ONE to Number_Of_Frames_In_Group do

Begin

Ingest_Current_Frame_From_Camera_Buffer( Frame_Buffer_Images[ i ] );

for y := ONE to Height_Of_Image do

for x := ONE to Width_Of_Image do

Begin

Process_Current_And_Neighbouring_Pixels( Frame_Buffer_Images[ i ],

x, y, Frame_Number,

Processed_Output_Images[ i ] );

End;

if Frame_Number = Number_Of_Frames_In_Group then

Save_Group_of_Frames( Final_Video_Output_Filename, Number_Of_Frames_In_Group );

Stop_And_Exit_Compression_Program_Whenever_Main_Window_Is_Closed;

Except

Handle_Overflow_UnderFlow_NAN_Exceptions;

End;

End;

Procedure Save_Group_of_Frames( Output_Filename: Character_String_Type;

Number_Of_Frames_In_Group: Signed_Integer_Type );

Begin

Try

Open_File( Destination_Video_File, Output_Filename, APPEND_TO_END_OF_FILE );

for Frame_Number := ONE to Number_Of_Frames_In_Group do

Save_Compressed_Video_Frame_To_File( Processed_Output_Images[ i ] );

Close_File( Destination_Video_File );

Except

Handle_File_Exceptions_Here;

End;

End;

Soooooooooo........Can you read and understand this????? If you can then I did my job!

0
0
Silver badge

Optimisations

We're often told that the tricks we learnt to get code to execute faster back in the days before good optimisation are worthless because a decent compiler will do that anyway.

This research gives the lie to that idea. Code written with DIY optimisations is substantially different from code written primarily for clarity. If this technique isn't defeated by compiler optimisations, then the optimisations are pretty unimpressive.

7
0
Silver badge

Re: Optimisations

Not necessarily. It could just be a "six of one, half a dozen of the other thing": more than one way to get comparable results.

5
0
Silver badge

Re: Optimisations

Some of the older tricks execute more slowly on modern CPUs, because the balance has changed.

Eg: Lookup tables can now be slower than recalculating, because the table doesn't fit in a cache line but the calculation does.

Taking advantage of SSE and AVC is often faster than loop unrolling.

You can do any of these manually, but having done so, you probably won't revisit and change it when the balance changes and another optimisation technique becomes better.

2
0

Now, RISC code after full optimization might be harder.... That stuff is strange.

1
0
Silver badge
Pint

Tables, nearly Code-Free State Machines, and future Requirements Compilation

Once upon a time (early 1980s), there was a coding contest to see how much functionality could be crammed into one line of BASIC code; limited to about 240 characters. I arrived at a way to have a one line 'engine', and then as many subsequent DATA statements as you wish. With the extra DATA lines, it wasn't really a 'One Liner' winner, oh well.

Each DATA statement was conceptually a row in a table, and each row effectively encoded a machine 'state'. The data elements were: State ID#, assigned action or output data, then an extensible list of condition values with their next state ID#. The program inputs caused the engine to jump around the table based on those inputs, as designed and listed in the table.

Essentially all the states of the machine would be coded into a big dumb table, and the actual code was simply a very tight little loop.

It's a powerful concept, in applicable circumstances. Put your machine states into a trivial table format, automatically transcribe it in, and then add the one line engine. Done.

The same thing could be done in assembler. A wee tiny bit of actual code, and then a huge table making it sing. The big silly table could be prepared in MS-Excel, even by a manager.

It's a small step from the above concept to that (soon to be here) future of Requirements Compilation directly into code. Spec writers become coders.

This sort of Table Driven State Machine coding method is very nearly code free. In case that helps.

6
0

Re: Tables, nearly Code-Free State Machines, and future Requirements Compilation

You just described how professors work. "Tables" = "machine code".

1
0
Silver badge

Re: Tables, nearly Code-Free State Machines, and future Requirements Compilation

I've always made extensive use of finite state machines as there was no way in hell I was going to let my code execute non-deterministically if at all possible. Breaking out or doing the unplanned was a ticket to, as I've said quite often, a federal prison should things blow-up or people are harmed or killed. So, you've pretty much described my style, no matter what the tools.

As to stylometry, I've not a worry in the world. I've not got code out there at all accessible. Not that I want to go off the reservation. Quite the contrary. Still, reassuring.

2
2
Silver badge

Re: Tables, nearly Code-Free State Machines, and future Requirements Compilation

Did you intend "processors" rather than "professors"? If so, then yes, although I constrain, and validate like hell, to a limited set of instructions. Just me being me though. I think.

3
1
Anonymous Coward

Re: Tables, nearly Code-Free State Machines, and future Requirements Compilation

"I've always made extensive use of finite state machines [...]"

I remember reading a paper on finite state machines in 1974 when producing a spec for a protocol driver. My design produced an "engine" that depended on one instruction in the machine's code set that proved very efficient for that purpose. It was too novel for the person who did the development - who coded it in more linear fashion. He did acknowledge it was the most complete spec he had ever used.

Even without using FSM tables per se you can produce data driven code that is basically an "engine". Over the years I have used many of my designs for totally new purposes. Not the most efficient at run time - but quick to implement an enhancement or a new use.

2
0
Silver badge

Re: Tables, nearly Code-Free State Machines, and future Requirements Compilation

These are NOT finite state machines.

These are Turing Machines.

Which a interpreted by a lower-level Turing machine. With Polynomial slowdown.

1
0
Anonymous Coward

MOV R0, #1

.loop

ADD R0, R0, #1

CMP R0, #100

BLE loop

Try de-anonymising that. While you could probably analyze C because everyone who writes it, uses their own unique style and way. I doubt its in anyway practical with assembler since theres only really 2 ways to write that looped or unwound.

1
0
Silver badge

OK, exactly where does the snippet fit into the rest of the code, how does the code around it mesh with the loop, do you use CMP #100/BLE or CMP #101/BL instead? Or perhaps start with MOV #100, DEC, and BNZ instead (to skip the CMP step)? Just saying there's more than one way to skin a processor.

2
0
Big Brother

a rent on life, middle-mannig it with code

That's easy to avoid, just copy everyone else's code

it all does the same thing anyway and your' just wanting to seem indispensabile.

a rent on life, middle-mannig it with code

5
0
Anonymous Coward

Re: a rent on life, middle-mannig it with code

When I did support programming I always used to imitate the style and intentions of the original author so that the change was seamless.

That took time to understand how the original code worked. Development colleagues often just grafted on a blister of code in their preferred style. Often fixing the symptom rather than the underlying problem.

1
1
Silver badge

Re: a rent on life, middle-mannig it with code

"Often fixing the symptom rather than the underlying problem."

Apart from having just described Microsoft's model for the entire 1990s-2000s, in a lot of cases the underlying problem WAS the style and intent of the original author.

On more than one occasion the correct fix was to replace the code entirely.

0
0
Silver badge

Reproducibility

Take a look at the source code of theirs on Github. Decompiling binaries back to C is so messed up it's not funny. Sure, they got something. However, this is something that bears examination, and I really question what they did. How much picking and choosing did they do for their data sets? Did they throw out code that didn't reliably decompile? Because I have some stuff I'd like to see how the Snowman decompiler does on it.

Also, their "obfuscation" was a bit on the trivial side, using the llvm obfuscator.

I would like to see more work on this, and see if this is reproducible with different compilers, different options, etc. They seem to have tried one thing at a time, and not combinations.

4
0
Silver badge

It doesn't matter

What is important is to use the same decompiling toolchain with all the samples as well as the code-under-test.

This is looking for common patterns. It doesn't really matter what the patterns look like, only that they exist.

All you actually need is large samples of binaries of known provenance to compare against.

The technique is new, the theory isn't. Malware has been traced back to specific groups (named or otherwise) many times.

0
0
TRT
Silver badge
Devil

Thank goodness...

I've only ever released StackOverflow copypasta.

12
0
Silver badge
Pirate

Re: Thank goodness...

Came to copy and paste this same comment. Left satisfied.

2
0
Anonymous Coward

"Another is using a different identity for every bit of code released."

Would that not reveal a common identity shared between the several identities? It would still provide a significant modus operandi that might then be correlated with another linking factor.

2
0
Silver badge
Trollface

Hmm...

So it can/might identify a coder from their repositories...

So for anything you release to the public, make it all your own work.

For anything you don't want tracing back to you use a good spread of copypasta.

Or just use off the shelf stuff and let the blame fall elsewhere.

2
0
Anonymous Coward

And the takeaway is

Pay someone else to write your malware.

0
0
Silver badge

So how is this different to other kinds of stylometry...

... which you can easily get around by just write code in another style. Having stylometry even allows you to modify your code gradually so it'll look like code from someone else.

0
0
Anonymous Coward

Variation Space, a bit like Address Space

They'll have identified some characteristics, each with several possible values (style used). These together define the maximum possible "Address Space" of this identification scheme. E.g. 10 characteristics with 4 choices each is equivalent to 20 bits, or one in a million.

But then they'd need to account for 'Bell curves' where the possible values are not evenly used. Then they'd also have to account for the correlation across characteristics. The effective Address Space will be a fairly small fraction of the theoretical space. Probably an order of magnitude, maybe two orders, effectively smaller. E.g. ballpark one in 30k.

These are just the extremely basic Address Space considerations. What about: Noise, Deception, Unknown Libraries, Obfuscation, Misunderstood Processes, Copying, Sample Code, etc. ?

Although not up to Evidence standards, it might have some value as an Investigative Tool, but positive or negative value? Positive value to the actual perpetrator who was able to frame someone else by copying their code and mimicking their style for the changes?

There's a growing pile of discredited "forensic sciences" (sic). I suspect that a space should be reserved for this one.

0
2
Anonymous Coward

Re: Variation Space, a bit like Address Space

The Address Space conceptual analysis described above is perfectly sound. Although somewhat trivial, it is a missed step far too often. The conclusions follow directly from it.

Their only escape from this crtique is if they have a large library of such coding-fingerprint characteristics, and have already accounted for the Bell curve and internal correlations limiters.

Downvotes without explanation are a bit pointless.

0
0

StackOverflow

Some guy on stackoverflow.com is going to have a lot of code attributed to him...

8
0

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Forums

Biting the hand that feeds IT © 1998–2018