back to article Your anonymous code contributions probably aren't: boffins

There's no such thing as an anonymous programmer: your coding style can unmask you, according to research led by Drexel University Comp. Sci. PhD student Aylin Caliskan-Islam. In work that has serious implications for anyone believing their open source project contributions are anonymous, the researchers find that as many as …

  1. BongoJoe

    Exactly

    If it's code without any strong Type on the web then it's most likely examples of code from Microsoft itself.

  2. ChrisM

    Not a new concept

    Morse code operators could be identified by their distinctive 'fist', people can be identified by their handwriting, you can tell a person who rides from their riding style before you can make out details of their face or by their gait when walking towards you.

    1. AbortRetryFail

      Re: Not a new concept

      I was just about to post a near-identical post but you beat me to it.

      1. oolor
        Black Helicopters

        Re: Not a new concept

        Morse code perhaps, but:

        >Perhaps a programmer has a preference for spaces over tabs, or while loops over for loops...

        Those bastard 3 letters are on to me, and I haven't even thought the thought crime yet.

      2. Anonymous Coward
        Anonymous Coward

        Re: Not a new concept

        > I was just about to post a near-identical post but you beat me to it.

        And you beat me to being beaten to it.

    2. Dom 3

      Re: Not a new concept

      What's new is doing it *computationally*. I can identify code as mine / not-mine long after I've forgotten anything about it.

    3. This post has been deleted by its author

    4. Version 1.0 Silver badge

      Re: Not a new concept

      Right - I've been identified walking around the office simply by the sound I make walking down the corridor regardless of which pair of shoes I wear.

  3. AndrueC Silver badge
    Meh

    The mainstream idea is that better programmers write shorter and cleaner code which contradicts with line of code statistics

    It depends what is meant by 'cleaner' in this instance. Introducing an unneeded variable might be considered 'unclean' but it could improve readability and any half way decent compiler will optimise it out. In my experience short and concise code is harder to read and by trying to be too clever people are more prone to making mistakes.

    Unless you're doing embedded coding you can pretty much rely on the compiler to generate better code anyway (especially with languages like C# and Java) so clarity of source is more important than keeping things short.

    1. Stephen Booth

      Productive -> Better

      Note that the metric used to identify "Better programmers" was productivity. Productive programmers will also spend more time (and lines of code) on edge cases, error handling and re-usable encapsulations because these save time in the long run.

  4. jake Silver badge

    C++ ...

    ... there's your problem. Too much room for personality.

    K&R C and assembler tell the hardware what to do, and don't leave much room for personality to stand out ... at least not when done right. And the result is a hell of a lot faster (and arguably safer from a security standpoint) than code written in C++.

    (Before you try to argue with me, ask Cupertino what their kernel is written in.)

    1. Dan 55 Silver badge

      Re: C++ ...

      RAII says you're wrong.

      Maybe a kernel requires C because it acts as the intermediary between hardware and the rest of the system and it's better to stick with tried and tested code than change everything to C++ just for the sake of it, but the average program does not.

      1. jake Silver badge

        @Dan 55 (was: Re: C++ ...)

        "the average program does not."

        There's another problem. I was talking about fast, secure code. Not average code. See the difference?

        1. Kristian Walsh Silver badge

          Re: @Dan 55 (was: C++ ...)

          Yes, BSD's kernel is written in C.

          If Apple had acquired Be rather than NeXT in 1997 (i.e, if they had made their decision on purely technical grounds, rather than on who the CEO of NeXT was), then their kernel would now be written in C++.

          (Haiku, and it's now Open Source, if anyone is curious)

          As it is, Cupertino's current device driver and I/O layer is written in C++, and so are many of the low-level libraries unique to OS X.The remainder are C, or Objective-C for less performance-critical ones.

          This illustrates only that competent developers use a variety of tools. OS X was not a clean-sheet design; it was actually something of a rush-job as Apple had fallen far behind Sun and Microsoft in OS capabilities and desperately needed to catch up: you have to remember that even in 2000, Mac OS was only co-operatively multi-tasked; so one bad application could often kill your entire system. OS X as released was an amalgamation of many different sources: each was chosen because it was a proven, viable subsystem, not because it was written in the One Holy Language.

          But, going back to kernels: The reason why the BSD kernel is written in C is because AT&T's UNIX kernel was written in C, and that was because C was the language that K&R developed specifically to allow their UNIX OS to be portable across AT&T's various system architectures.

          1. AndrueC Silver badge

            Re: @Dan 55 (was: C++ ...)

            FWIW I believe that the Windows kernel is also written in C but its data structures are objects so in that sense it is object orientated.

            1. wikkity

              Re: object orientated

              You can write OO code in any language if you are perverted enough.

              1. Kristian Walsh Silver badge

                Re: object orientated

                "You can write OO code in any language if you are perverted enough."

                Ah, you've used glib GObject, then...

                At this point, I'll have to confess to writing quite a bit of OO code in 68000 assembly, although I didn't recognise it as such at the time. (I even had vtables)

                @Dan55 above on C being "faster". It's not. C++ code runs exactly as fast as the equivalent C code - C++ actually offers a good compiler more optimisation opportunities. Non-virtual method calls are simply C function calls, in-function variables are allocated on stack just as in C, and exceptions/RTTI can be disabled if your module doesn't require them. (Just specifying throw(); at the end of your method declaration removes the overhead of exceptions in that method even if the rest of your code uses them). C99 borrowed a lot of its nice features from C++ (it's a shame that C++ took so long to take "null"

                Just because C++ lets you quickly write inefficient code (like copy-by-value parameter passing for superlarge types), it doesn't mean that C++ is itself less efficient, just that some people don't know as much about programming as they think they do. (The small consolation of such dumb behaviour is a. it's less likely to cause bugs than naive use of pointers is, and b. you can optimise the problem away later.)

                I'm happy to accept the argument that C++ leads developers to use less speed-efficient data-structures like the STL containers for tasks where a hand-rolled equivalent would be superior, and that C++ code can thus be slower as a result. But that's trading dev-time for run-time. Unlike a hand-rolled data structure, the STL version will work reliably straight away; and in general, I heed Dr Dijkstra's warnings about premature optimisation...

                1. jake Silver badge

                  @ Kristian Walsh (was: Re: object orientated)

                  "C++ code runs exactly as fast as the equivalent C code"

                  Uh ... no. Show me real-time code that is written in C++.

                  1. Kristian Walsh Silver badge

                    Re: @ Kristian Walsh (was: object orientated)

                    "Uh ... no. Show me real-time code that is written in C++."

                    Uh ... yes:

                    http://en.wikipedia.org/wiki/Symbian

                    Hey, here's two more that you can see the code for:

                    http://scmrtos.sourceforge.net/ScmRTOS

                    http://miosix.org/index.html

                    Whether a kernel is C or C++ depends more on when it was started than anything other factor. C++ is a superset of C; anything that needs C for "efficiency" is just as possible with C++ code.

                    1. jake Silver badge

                      Re: @ Kristian Walsh (was: object orientated)

                      Symbian? There's a fail. And it was/is all K&R C.

                      scmRTOS? That's all K&R C.

                      miosix? "supports" C++ ... Straight C otherwise.

                      C++ is indeed a superset of C. That doesn't mean that everything compiled with a "C++" compiler is actually written in C++.

                      1. Kristian Walsh Silver badge

                        Re: @ Kristian Walsh (was: object orientated)

                        Ha! Does your voice get muffled when you sit down, jake?

                        We mere mortals don't have your custom build of K&R that can compile namespaces, the 'this' keyword, variable instantiation within sub-scopes, default-value initialisation, function calls using the dot and pointer operators, and templates.

                        1. jake Silver badge

                          Re: @ Kristian Walsh (was: object orientated)

                          No, Kristian, my ears are quite clear of bogus marketing bullshit.

                          But then I code close to silicon, not close to marketing memes.

                          For me, GCC spits out assembler. I hand massage it. My customers are happy.

                          1. Kristian Walsh Silver badge

                            Re: @ Kristian Walsh (was: object orientated)

                            You can't tell the difference between C++ and C source-code, though. Hardly a good starting point if you're going to pronounce on the advantages of one over the other.

                            1. jake Silver badge

                              Re: @ Kristian Walsh (was: object orientated)

                              Again, not everything compiled with a "C++" compiler is actually written in C++.

                              HTH, HAND.

                              1. Kristian Walsh Silver badge

                                Re: @ Kristian Walsh (was: object orientated)

                                You haven't even looked at the code, have you? There is no C compiler in existence that can compile the projects I cited. That is because they are written in C++.

                                For the record, there's is also a difference in the output of "C code compiled with a C++ compiler", and "C code compiled with a C compiler". If something is written in C, we will use a C compiler to compile it, because that preserves the other assumptions about C code (particularly symbol naming, but there are other, more subtle differences).

          2. jake Silver badge

            Re: @Dan 55 (was: C++ ...)

            "The remainder are C, or Objective-C for less performance-critical ones."

            So the "performance-critical ones" are written in C? Seems to negate the entire rest of your post. Think about it.

            "As it is, Cupertino's current device driver and I/O layer is written in C++, and so are many of the low-level libraries unique to OS X."

            And the bugs creep in where, exactly? It ain't in the kernel ...

            "But, going back to kernels: The reason why the BSD kernel is written in C is because AT&T's UNIX kernel was written in C, and that was because C was the language that K&R developed specifically to allow their UNIX OS to be portable across AT&T's various system architectures."

            What you are forgetting (or ignoring) is that nobody has invented anything better than K&R for cross-platform kernel development. It's not inertia, it's reality.

        2. Dan 55 Silver badge
          Stop

          Re: @Dan 55 (was: C++ ...)

          All code should be fast and, especially these days, secure. If the code doesn't have to hit the hardware it then is in all probability easier to write and maintain in C++ than C.

          1. jake Silver badge

            @Dan 55 (was: Re: @Dan 55 (was: C++ ...))

            "If the code doesn't have to hit the hardware"

            WTF? Code doesn't run in a vacuum, code tells hardware what to do.

    2. Anonymous Coward
      Anonymous Coward

      Re: C++ ...

      Err.. NO

      You can recognize a personality in any language. It takes me a few split seconds to look at a piece of the Linux kernel code and say Al Viro, Theodore Ts'o or "Not Alan Cox again..., time to look for an obscure logic error somewhere". That is C for you as an example.

      Similarly, I can recognize in a split second the style of various people I have worked with in python, perl, java, etc. Even projects that have vicious style requirements (kvm/qemu) still show distinctive style of key contributours making the author instantly recognizeable.

      What is more interesting is how does this handle the evolution of the person's coding technique over time and over project changes. For example - my code prior to working on kvm/qemu for a while and after looks and reads like written by different people.

      Anonymous... Just for the fun of "recognize me programmatically" :) By el-reg posting style...

      1. yoganmahew

        Re: C++ ...

        "Err.. NO

        You can recognize a personality in any language."

        Likewise in mainframe assembler. I can recognise fellow coders by the instructions they use (versus alternates), the way they structure their logic. The most telling thing, though, is the 'shape' of the code and the commenting style - verbose or no comments, instructions and comments neat or higgeldy-piggeldy.

        It does help, though, that we use initials in comment tags, so 40 years of modifications to a program can be laid bare...

      2. jake Silver badge

        @AC "21 hrs" (whatever that means, ElReg)Re: C++ ...

        "You can recognize a personality in any language."

        OK, I'll bite.

        "It takes me a few split seconds to look at a piece of the Linux kernel code"

        Find me. I've been contributing to the Linux kernel for over two decades.

  5. Torben Mogensen

    Time versus code length

    That competitors who complete more tasks in code competitions have, on average, longer programs than those who compete fewer tasks is not surprising. Mark Twain is attributed for ending a letter with "I apologize for the length of this letter. If I had had more time, it would have been shorter". The same is true for programming; It takes more time to write shorter code. It is often faster to cut-and-paste and do local modifications than to make a parameterised procedure to cover all cases, and sometimes it is faster to special-case on different inputs than to make a general solution, which often requires insights that take too long to obtain when you are pressed for time. And you certainly don't want to spend time on simplifying code that works. Good competition programmers also often have a standard skeleton program that they modify for each task, because it is faster than starting from scratch. So there will often be procedures that the programmers do not bother to remove even if they don't use them. They don't harm, so why use time to remove them?

    Coding competitions are very different from normal programming: The problems are small and self-contained, so you don't have to worry about modularisation or readability of the code (in a few hours, nobody will ever look at the code again), and the process is more explorative than normal coding. So you can't draw conclusions about general coding style from such competitions.

  6. Torben Mogensen

    Obfuscation

    Most code obfuscation is done at the lexical level: whitespace and comments are eliminated, variables and procedures are renamed, macros are expanded, and so on. As mentioned in the article, such tools can not hide coding style, as this goes far beyond lexical details. So a good obfuscation tool must work on the semantic level of the program: It must replace code with semantically equivalent code using more than just local syntactic or lexical transformations. This is very difficult to do, especially if the language semantics is loosely specified (*cough* C *cough*). Writing such a tool is (at least) as complicated as writing a compiler, which is why it is rarely done. But there is research that points the way: http://dl.acm.org/citation.cfm?id=2103761

    1. Pen-y-gors

      Re: Obfuscation

      The article suggests that obfuscation tools can't beat their analysis, but as one of the things their analysis uses is lexical information, then presumably obfuscation at least makes it harder to get a match, although still not impossible - presumably it would need larger samples to work on? If it works perfectly well without lexical information, then why waste time using lexical information in the first place?

    2. Anonymous Coward
      Anonymous Coward

      Re: Obfuscation

      Is it really that hard, assuming you don't mind discarding a few obscure optimisation opportunities? LLVM already has a C back-end.

      The paper you pointed at makes a program "harder to understand or analyze". But that's presumably not required for merely disguising the author.

    3. Anonymous Coward
      Anonymous Coward

      Re: Obfuscation

      We tried commercial semantic obfuscators. The result passed all our regression tests but really pooched our benchmarks -- we sold scientific libraries that were *very* time critical -- so we stuck with lexical obfuscators for source sales.

      Now I am intrigued enough to actually read the original paper.

    4. LucreLout

      Re: Obfuscation

      Most code obfuscation is done at the lexical level

      Most of the obfuscation in the code I'm looking at just now seems to have been done by someone who has no business with a compiler. Any metric that determines this lengthy, monolithic spaghetti code is more productive than a properly architected version with a tenth of the code, is a metric that discounts the software life cycle in favour of only initial coding time.

      12 projects full of 2000+ line classes, everything is concrete, nothing is abstract, there are no interfaces, no patterns... I'm not sure an obfuscator could make things any more difficult to work with. It's just awful.

  7. Anonymous Coward
    Anonymous Coward

    Satoshi Nakamoto unmasked yet?

    1. Anonymous Coward
      Anonymous Coward

      Satoshi Nakamoto == Raymond Chen

      Has to be true; today's Microsoft can't be giving hm enough to do.

  8. DropBear

    Hmmm....

    I wonder how easily would this tool identify someone who knows he might be profiled and consciously tries to stay away from some of his own known habits. Obviously, trying to do this many different times would be a short route to the nuthouse but perhaps it would work for one or two specific known-dangerous things to contribute to, as a departure from one's "normal" coding style. You know, start using else-ifs instead of switches, suddenly pick up a preference for Hungarian notation, pass everything through GNU indent at default settings, that sort of thing...

    1. Will Godfrey Silver badge

      Re: Hmmm....

      Also, what about the case where someone is maintaining code and has (sensibly) decided to stick to the original coders style?

      1. John H Woods Silver badge

        Re: Hmmm....

        "Also, what about the case where someone is maintaining code and has (sensibly) decided to stick to the original coders style?"

        -- or copied and pasted a good example from the net?

    2. Anonymous Coward
      Anonymous Coward

      Re: Hmmm....

      'consciously tries to stay away from some of his own known habits'

      Like, say, someone with a job who decides to freelance by working on some custom malware targeting their employer, customers, whatever?

  9. dogged

    Bullshit

    The result on about 80% of decent .NET code these days would be "clears Green on Resharper" which hardly identifies anyone.

    It's all about the tools...

  10. sisk

    Not Surprised

    It's doesn't take long when working on a team project where you're seeing code from other people to learn to recognize which code came from whom. It's kinda neat they they can teach a computer to do it, but I've been in plenty of situations that demonstrate the concept before. You know it's going to be a long day of bug hunting when you recognize a particularly bad programmer's style.

    1. Alistair
      Linux

      Re: Not Surprised

      "You know it's going to be a long day of bug hunting when you recognize a particularly bad programmer's style"

      Absolutely agree. Been in my current employer for 17 years. There are a few coders (shell, java, app config, COBOL (yes its still here) and JCL, C, C++, Python, Perl and (ick) Ruby) - that I know by sight. Some have left and I still know "who wrote that". -- there was a project recently that had been "dropped on the ground" by management shuffles, at some point it became a critical issue and, well, from a comment style in an apache virtual host config I knew who had set it up....... and don't get me started on the tomcat application config -- *that* told me what team had put it in.....

      (I might be a linux nut, but yes, I still look at the mainframe once in a blue moon too)

  11. Dr Paul Taylor

    students

    its techniques could be used to identify plagiarism among computer science students

    I found that laying one printout next to the other was an adequate technique!

    Though, it is true that the spaces and tabs were a giveaway, when the indentation was, shall we say, merely decorative.

    1. Crazy Operations Guy

      Re: students

      Although I'd imagine that you'd get a lot of false positives as a lot of students would have very similar coding styles to their teachers and the fact that they are all trying to solve the same, usually trivial, problem, I'd imagine a lot of them would look very similar to others.

      1. John Gamble

        Re: students

        It can go in the other direction too. My coding style certainly changed over the semester as I learned what worked for me and what didn't. Begin/end and brace placement changed heavily over the semester, as did my indentation width.

        It happens as one learns a new language as well. I could definitely look at an early program in a new-to-me language and tell what other language was influencing my style at the time.

    2. the spectacularly refined chap

      Re: students

      I found that laying one printout next to the other was an adequate technique!

      Though, it is true that the spaces and tabs were a giveaway, when the indentation was, shall we say, merely decorative.

      Those things are easily altered. I can't help but think back to my own time at Uni - the University of Manchester - where all code went into John Latham's ARCADE system that detected any plagiarism. This is over 20 years old now.

      He did explain how it worked a couple of times and although he never used those terms it seemed to perform a lexical analysis first and then consider the resulting token stream. Comments, white space, variable names etc were thrown out straight away as trivially easy to alter. Instead it simply looked at a sequence "identifier, multiply, constant, terminator..." that is much more difficult to alter in a non-trivial manner since it is intrinsically linked to how the program works.

  12. captain veg Silver badge

    copy, paste

    "“Programmers who are more advanced and are able to solve more difficult tasks have more distinct coding styles than programmers that are not as advanced."

    Because the latter group simply copy and paste snippets of code written by the former?

    -A.

  13. Snowman

    Interesting since any time I look at code I wrote years ago I tend to find the parts of style I have shifted on are the most frustrating; // except for the few parts that for some reason went without comments, where it would now be useful to have them

  14. Anonymous Coward
    Anonymous Coward

    "Programmers with a larger skill set can be identified more easily and with higher confidence. "

    As the most incompetent, unproductive programmer in the whole history of computing, I guess my anonymity is quite safe, thank Stallman.

    1. proto-robbie
      Facepalm

      Re: "Programmers with a larger skill set can be identified more easily and with higher confidence. "

      You're not, unless you're the bloke whose dross I inherited (brain the size of a planet, etc...).

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like