back to article Run a JSON file through multiple parsers and you'll get different results every time

The ubiquitous message-passing JSON format is something of an untended garden with plenty of security and stability traps for the unwary. That warning comes from software engineer Nicholas Seriot, who last week presented his work on JSON parsers to an audience at Geneva's Soft-Shake Conference. The problems arise because …

  1. Andrew Commons

    Welcome to the Internet

    Tools such as Nmap rely on implementation differences to fingerprint end points. These implementation differences are invariably fuelled by sloppy specifications - aka RFCs - that use the terminology of RFC2219 (and all too frequently RFC6919) to specify the technology we rely on.

    These should be reduced to MUST and MUST NOT before things get any better and even that is probably not going to be sufficient.

    I assume tools like nmap will jump on this :-)

    1. Anonymous Coward
      Anonymous Coward

      Re: Welcome to the Internet

      I've been horrified by the laxness of quite a lot of the RFCs I've read. As a means of unambiguously defining a standard they've got a few problems.

      Standards are boring, tedious, and dull, or at least so the younger coders think. Thing is as they spend the earlier part of their careers dealing with the mess they've often created for want of a proper well implemented standard, they slowly become more convinced.

      JSON is just another example of something embraced by "hot young radical" coders who are slowly learning that it is indeed hot, though only in the same sense as a steaming pile of composting shit. JSON is a lazy alternative to Doing The Job Properly, picked up by youngsters too lazy to imagine that there's a vast history of other programmers who have been there, seen it, bought the T shirt and fixed it properly A Long Time Ago.

      In principal JSON schemas are OK. They're one of the few schema languages that let you set size and value constraints on message fields, just like XSD and ASN.1. For reference, GPB, Thrift, and loads of others do not. Handling such constraints is the most useful thing that anything like this can do, without that you're back in the dark ages of Writing It Down And Expecting Some Dumb-Ass Coder To Read It (Yeah, Right). Unfortunately, as Seriot shows, the tooling and standards (never mind standards compliance) around JSON, JSON schemas is diabolically bad.

      1. arthoss

        Re: Welcome to the Internet

        Thanks that was interesting. What do you think of JSON with ODATA? That would cover it the same way as ASN.1, no?

        1. Anonymous Coward
          Anonymous Coward

          Re: Welcome to the Internet

          @arthoss,

          "Thanks that was interesting. What do you think of JSON with ODATA? That would cover it the same way as ASN.1, no?"

          Possibly, I've not used it. But if the JSON part is ambiguous then it's less good than something unambiguous.

      2. allthecoolshortnamesweretaken

        Re: Welcome to the Internet

        As a species, we're kinda slow learners, aren't we?

        1. Anonymous Coward
          Anonymous Coward

          Re: Welcome to the Internet

          That's because it's only individuals (some individuals, some of the time) who can learn. The species has the means to "learn" in some senses - that is what Korzybski defined as the ability to "time-bind", which he considered unique to humans. But, both in individuals and for the species, learning is strictly optional.

          We are much too short-lived to be able to derive much benefit from learning. "Ars longa, vita brevis". By the time a typical person has learned enough to have some kind of a grip on reality, he is so old that he no longer feels much desire to do so; and of course the younger generation feel free to dismiss him as an old fool living in the past. Just look at the way people like Tony Blair castigate the Victorians as hopelessly old-fashioned - practically medieval. If Blair had learned anything from his history books and lectures, he would understand that virtually every advanced technology we enjoy today was pioneered and developed in the Victorian era.

          The challenge of learning as a species, while very difficult, is just part of the even greater challenge of governing ourselves. Homo sapiens evolved living as small tribes of roaming hunter-gatherers, and so our instincts all clamour for a strong, capable leader. Given such a leader, a group of fewer than 200 or so - say a military company - is eminently manageable. Everyone knows everyone else, and understands their strengths and weaknesses. Above all, everyone knows who is trustworthy and who is a cheat or a liar.

          As soon as communities grow to thousands, millions, and billions, all our social instincts lose most of their value. We elect leaders who sound like strong, honest, capable people but who are actually little more than good orators and wheeler-dealers. While they are indulging their vast egos and feathering their nests, the rest of us are left to pursue our own interests - mostly under the guidance of the "profit motive", as described by modern economics' image of "homo economicus". Because such models are wholly alien to our natural instincts, the result is that most of us are stressed and unhappy, and a tiny handful accumulate vast wealth that they can never spend, while most of the rest remain unfulfilled, poor and miserable.

          The criterion of whether homo sapiens is truly intelligent is whether we can get out affairs in order as a species. So far, it's not looking at all good.

      3. Warm Braw

        Re: Welcome to the Internet

        To be fair, JSON is in part an answer to the problem that ASN.1 is (the clue is in the name) an Abstract specification. After that you have to pick an "encoding rule" and, while these are rather better defined than JSON, they're not without their problems: there can be different ways of encoding the same thing (potentially awkward if you're trying to do digital signatures) and encodings that can't encode all possible data (GSER - which is why it's not widely used). The binary encodings are prone to programming errors if hand-crafted, but tried-and-tested libraries can be a fairly heavyweight addition to a simple project and are not immune to bugs even now.

        Having said that, there does seem to be an irrational prejudice in the Unix-derived world to exchanging data in any form other than western-culturally-specific string formats which don't seem to be either time- or space-efficient. And to anything arising from ISO/IEC/ITU. So a complex standard with at least different options for encoding the actual data is going to struggle, even if it has burrowed into some of the further corners of the Internet (e.g. SNMP and the widespread use of X.509 cerificates).

        1. Anonymous Coward
          Anonymous Coward

          Re: Welcome to the Internet

          @Warm Brew,

          "To be fair, JSON is in part an answer to the problem that ASN.1 is (the clue is in the name) an Abstract specification."

          Well, I'm not sure that the term abstract in this sense is a real problem. Both JSON, ASN.1 and XML can represent any data in a manner in a programming language neutral way. The schema languages follow a similar philosophy (i.e. allowing one to say everything there is to know about data that will be passed). The only real difference in outcome is that JSON and associated tools are not as well pinned down as, say, ASN.1 or XSD.

          "After that you have to pick an "encoding rule" and, while these are rather better defined than JSON, they're not without their problems: there can be different ways of encoding the same thing (potentially awkward if you're trying to do digital signatures) and encodings that can't encode all possible data (GSER - which is why it's not widely used)."

          Personally speaking I find the wide variety of encoding rules available under ASN.1 very liberating; there's one for every occasion. Mixing and matching between them can be a pain in the arse, and understanding exactly which variant is in use is indeed problematical. But at least they're rigidly defined. GSER sounds, well, unpleasant...

          "The binary encodings are prone to programming errors if hand-crafted, but tried-and-tested libraries can be a fairly heavyweight addition to a simple project and are not immune to bugs even now."

          Agreed, handcrafting ASN.1 serialisation code is definitely a mugs game. I've found that a couple of the commercial toolchains / libraries are pretty good, and have for me at least been excellent value for money.

          I've had interesting discussions with the vendors concerning the "heavyweightedness" of their tools, particularly C++. These days one would imagine that they'd be leveraging all the nice containers provided by the STL to give a pleasant programming interface to things like lists, sequences, etc. but there's a strong resistance to doing so. Part of the issue is that they've all built up non-STL C++ implementations which, conveniently, they can support on platforms of lighter weight (e.g. C++ on microcontrollers where STL is a rare beast).

          And yes, there can be bugs. I've come across some myself. But in theory every bug identified and fixed which results in a whole lot of code everywhere becoming better with minimal effort on the part of the system developer. They just get an updated toolchain, rebuild, et voila. Certainly it's better than having to fix numerous pieces of handcrafted code every time an interface is found to be crufty.

          "Having said that, there does seem to be an irrational prejudice in the Unix-derived world to exchanging data in any form other than western-culturally-specific string formats which don't seem to be either time- or space-efficient."

          It's definitely a UNIX thing. Example: the howls of rage with SystemD not storing logs in human readable text, requiring a journal control programme to access log data. Part of the Unix world's preference for text is "ease of debugging", but I think that's false economy. Having exchanges easily read by a human is all very well and good, but if you end up having to do that a lot because the interface spec is poorly written and poorly implemented then it's a waste of time. Far better indeed to go for a rigid interface definition system of any sort and prevent having to spend endless hours pouring over text files / streams, etc.

          And for tools like ASN.1 there's some excellent debugging capabilities anyway. With the commercial tools you can get some excellent data viewers which will interpret data and show you what it contains. The online ASN.1 Playground is a useful freebie for doing this. Wireshark also includes an ASN.1 parser - capture a TCP stream, import the ASN.1 schema, see what data is being exchanged. Wireshark can also do the same for Google Protocol Buffers.

          So I think in this day an age the idea that things have to be human readable to be debuggable is well and truly debunked.

          "And to anything arising from ISO/IEC/ITU. So a complex standard with at least different options for encoding the actual data is going to struggle, even if it has burrowed into some of the further corners of the Internet (e.g. SNMP and the widespread use of X.509 certificates)."

          Perhaps. Let's not forget LDAP too, something that no one has been able to displace. Some of these things have come out of X500 rather than out of Unix, which may explain their nature.

      4. Andy 73 Silver badge

        Re: Welcome to the Internet

        As a coder who long ago stopped being 'young' (though I am still hot thank you very much), I'm wary of those developers who bang on about Doing the Job Properly, when it translates to picking up an overblown spec and spending three months implementing it to the letter just to store trivial data. They're the same guys who foisted XML on us and who drive us to use obscure libraries because the 'proper' solution is only used by a handful of people. Some would consider this to be an offshoot of MDD.

        As it is, I note that of the Java JSON libraries, which are mature and well supported, the worst crime is a failure to parse, caused by deliberately badly formed documents. On the whole I would not plan to use JSON to read any document that I hadn't created myself, and try to avoid exposing end users to such things.

        JSON was a reaction to the heavyweight formats that flourished in the 90's and useful in exactly the situations that they were not. Perhaps the older coders can remember that too?

        1. Buzzword

          Re: not parse JSON documents that I hadn't created myself

          > I would not plan to use JSON to read any document that I hadn't created myself

          But if you have a public-facing website, anyone can POST a JSON document at your endpoints and potentially crash your server. How bad that actually is depends on how robust the rest of the system is at handling crashes.

          Worse would be a situation where a single JSON document gets parsed by two different engines. For example the JSON parses correctly in the bank's deposit-into-my-current-account function, but throws an exception in the corresponding deduct-from-my-savings-account function.

          (Unlikely yes, but there are other less serious examples which could still cause trouble.)

          1. isogen74

            Re: not parse JSON documents that I hadn't created myself

            If you have a public facing website people can POST all sorts of shite at it, JSON or otherwise, whether you like it or not. Validating your inputs from untrusted sources still applies (there are good JSON parsers our there which can handle invalid inputs safely).

            I'm not sure JSON is any worse in this regard than any other form of data upload; it's just an arbitrary string after all ...

            1. Andrew Commons

              Re:it's just an arbitrary string after all...

              More likely to be a very carefully chosen string particularly when the parser has been identified and it's parsing quirks are known.

              Quite a large number of the parsers tested supposedly parsed input they should have rejected. That would be an interesting path to explore if you wanted to inject invalid data into an application.

            2. Andy 73 Silver badge

              Re: not parse JSON documents that I hadn't created myself

              Indeed, it's true if you have a public website that uses public endpoints that have to handle JSON, you have to guard against invalid inputs. The same would apply to any data format. Note that the Java libraries all passed the testing outlined here without causing crashes, or incorrectly parsing valid inputs.

              The point being that none of this is a reason to jump to some heavyweight and overblown interchange format 'because proper'.

            3. Anonymous Coward
              Anonymous Coward

              Re: not parse JSON documents that I hadn't created myself

              It's exactly the parser that should validate inputs - and raise proper exceptions/errors (hoping the programmer will handle them properly and not hide them under the carpet...) - and not crash. The point of the study is there are too many parser that can't handle invalid inputs properly, and the developer may not be aware of that.

              The problem of JSON is exactly being an arbitrary string. There's no reason non-string data are transferred as strings. Even different endianness can be coped with - TCP/IP works exactly that way without using data encoded as strings.

            4. Roland6 Silver badge

              Re: not parse JSON documents that I hadn't created myself

              re: " Validating your inputs from untrusted sources still applies"

              And this is the real problem with JSON as exposed by Nicholas Seriot. I cannot reliably use or implement a JSON firewall, because there it is almost certain to have made different interpretations on what is valid JSON to whatever the system that produced the JSON file and to whatever systems I may be forwarding that file to now and in the future. Thus effectively what Nicholas's test show is that presently JSON is not fit for it's intended purpose of a general purpose data-interchange format.

          2. Jason Bloomberg Silver badge

            Re: not parse JSON documents that I hadn't created myself

            But if you have a public-facing website, anyone can POST a JSON document at your endpoints and potentially crash your server.

            To me that's the fault of the JSON parser or whatever causes the crash; not a problem with JSON itself.

            It's the same as creating a GIF which blows open some image viewer; it's the viewer at fault, not GIF itself.

            1. Anonymous Coward
              Anonymous Coward

              Re: not parse JSON documents that I hadn't created myself

              Not if the specification is so lax and ambiguous that could become very difficult to write a proper parser. Of course not every parser error may be due to JSON specifications, but some could be.

          3. Anonymous Coward
            Anonymous Coward

            Re: not parse JSON documents that I hadn't created myself

            "Worse would be a situation where a single JSON document gets parsed by two different engines."

            Crap software is crap software regardless of implementation. An input document should never be re-presented at different stages of a staged process. The data should be extracted once, sanity checked, then verified against whatever tests of integrity are needed, and then go through a pipeline if multiple processing stages are involved. Anything else is failure by design.

          4. Number6
            Happy

            Re: not parse JSON documents that I hadn't created myself

            ... a single JSON document gets parsed by two different engines. For example the JSON parses correctly in the bank's deposit-into-my-current-account function, but throws an exception in the corresponding deduct-from-my-savings-account function

            Can I have a copy of that document please? I might have a use for it...

        2. Anonymous Coward
          Anonymous Coward

          Re: Welcome to the Internet

          > I would not plan to use JSON to read any document that I hadn't created myself

          Isn't that more or less equivalent to dismissing it as of no real value?

        3. AdamWill

          Re: Welcome to the Internet

          Exactly this. I mostly work in Python and came to the same conclusion: the worst failure for both Python 2 and Python 3 parsers was a failure to parse (not surprisingly, unicode shenanigans - probably not even specifically to do with JSON parsing), and this is only a problem if you're parsing untrusted input (or trusted input which might include one of the problematic values). Which I'm not. JSON's perfectly fine if you just want a quick, simple way to serialize data in a pretty well-known format. Fr'instance, I wrote a few trivial lines the other week to have a script which fires up when a certain event happens check if the same event has happened before (to a reasonable limit of previous events it cares about), and bail out in that case. I had it store the list of the last few known events as JSON, because it's right there in the standard lib and using it is like two lines of code. It would be absurd to drag pyasn1 into the code (which fits on one page) just to store a small list of strings, which originate from a trusted system and which I know aren't going to include anything but ASCII characters.

        4. Anonymous Coward
          Anonymous Coward

          Re: Welcome to the Internet

          @Andy 73

          "As a coder who long ago stopped being 'young' (though I am still hot thank you very much)..."

          I'll take your word for it!

          "...I'm wary of those developers who bang on about Doing the Job Properly, when it translates to picking up an overblown spec and spending three months implementing it to the letter just to store trivial data"

          Agreed. I'm all for picking a set of tools that does the job easily with the minimum of fuss. I'll even pay for tools. The ASN.1 standards may well be lengthy, but there's nothing lengthy about a good ASN.1 schema, or using good ASN.1 tools, or the wire format data, or integrating bits of the system together even if one's gone and used different languages on different platforms for different subsystems. Same goes for Google Protocol Buffers (though that doesn't support size and value constraints) and a few others, but as the article reports isn't universally true of JSON tools.

          The thing about tools that support ITU standards is that there's a strong expectation that the tools actually correctly implement the standards. If the tools vendor gets it wrong it means that, for example, a modem chip won't be allowed to be built into mobile phones because it'll screw up the networks. Bad karma for the tools customer. Bad karma for the tools vendor. Not a situation anyone involved wants.

          I wish that Google would add constraints checking to their Protocol Buffers. If they did that it'd be a very useful piece of kit. Simple schema language, support for a useful mix of languages, compilable across many different platforms, binary wire format and open source? Yes please. It'd be like ASN.1 cut down to its most useful essence.

        5. JLV

          >Perhaps the older coders can remember that too?

          Exactly. XML started out pretty well, but then got mired into the XSD vs DTD stuff and complexified itself by the time it got to S(imple, hah!)OAP. How many folks wrote actual XSDs rather than booting on it?

          Before banging on about how horrible and sloppy everyone in JSON-land is, why not push the same type of crap data into an XML parser? Wanna bet those will never crash? What about genius old-school parsers for "quality" text formats written by super-clever old programmers (i.e. my age, just the clever kind) based on back-in-my-days specs? Never crash on bad data, right?

          Hopefully this will serve as a wake-up call that a) the specs and implementations might need to polishing up and b) not to trust external sources of JSON overmuch. I wonder if just the relevant people agreeing on a common set of test files and making those available to the implementations, along with an agreed-upon expectations, wouldn't evolve things quite a bit.

          Thanks a heap to Nicholas Seriot for pointing this problem out.

          p.s. I wonder what would happen if you shoved these strings into database parsers that support JSON data types? Like for example PostgreSQL's native JSON. What does a hypothetical parser crash take out? Can you do that through native db-binding (i.e. not just because someone left themselves vulnerable to sql injection)?

      5. Anonymous Coward
        Anonymous Coward

        Re: Welcome to the Internet

        > JSON is just another example of something embraced by "hot young radical" coders

        Anon, I call bollocks on your fairly idiotic and condescending post.

        As it happens, I have experience with SOL (safety of life) systems, and we have used or been forced to use a variety of formats and approaches over the years. Invariably, the ones with overblown specs or relying on "strict" requirements have consistently shown to be the worst in terms of cost, performance and, more critically, safety.

        The exceptions where such strict and detailed processes can be justified, afforded, and are actually successful are very few (e.g., the NAVSTAR project was one such example).

        Don't know who was it that actually said "keep it simple", but he, she, or it was right.

        Unless you're a consultant, of the kind who just want to milk their clients without providing real value. In that case, overblown specs are par for the course.

      6. Anonymous Coward
        Anonymous Coward

        Re: Welcome to the Internet

        >Standards are boring, tedious, and dull, or at least so the younger coders think.

        Sadly in my experience standard bodies often are very political between different companies (ie guaranteed mish mash of crap ideal for no one). Not to mention the type of person who usually ends up on standards bodies is the guy that has been at his current place forever and thinks he is important and has built up political connections but is usually a pretty mediocre engineer manager wannabe they put on the standards bodies to get him away for mucking up the day to day work. Get several dozen of these types together and hardly a surprise the output sucks.

      7. Frumious Bandersnatch

        Re: "the job done properly" (with link to page about ASN.1)

        As I recall, ASN.1 parsers have also had exploitable bugs in them.

        I much prefer YAML over both of those but YMMV.

        1. Anonymous Coward
          Anonymous Coward

          Re: "the job done properly" (with link to page about ASN.1)

          @Frumious Bandersnatch,

          "As I recall, ASN.1 parsers have also had exploitable bugs in them. I much prefer YAML over both of those but YMMV."

          Indeed. I've not yet used YAML myself, but some colleagues have done so with good success.

          Another one that looks interesting is Cap'n Proto. This doesn't have constraints setting / checking in it (boo!), but it looks like it could be lightning fast.

      8. dajames

        Chalk and cheese

        JSON is a lazy alternative to Doing The Job Properly (link to Wikepedia article on ASN.1)

        JSON is a data representation format, ASN.1 is a language for describing such formats.

        There are a number of data representation formats specified for use with ASN.1 (ASN.1 calls them "encodings") and while most people using ASN.1 think naturally of DER there are other encodings, such as XER, which uses XML, and there is no good not to have a JSON encoding for ASN.1 -- JER anyone?

        Of course, if you tie your use of JSON to a strict schema, that could be specified in ASN.1, most of the problems discussed in this article cease to be an issue.

    2. AndrueC Silver badge
      Meh

      Re: Welcome to the Internet

      There are worse protocols to work with. I've spent the last three months implementing support for HL7 (2.x - thankfully no-one seems to want the XML variant). There's no shortage of documentation about that. Vast reams of stuff. Pretty much every last aspect of it is tied down. But there's the problem. You go into that much detail and you're starting to come up with a protocol that defines the domain. The tail wagging the dog.

      And you know what - no-one is exactly following the standard anyway. They probably all got sick of reading stuff like this. So after all the trouble people have gone to to describe every last field, component and sub-component every system has to be tweaked to get it to work.

      I think that as long as you see JSON as just a convenient way to transmit POD and you validate accordingly the risks are minimised. I bet the attack surface for a JSON parser is a lot smaller than that of an HL7 parser.

      Oh and I've been coding for over 30 years now. If someone wants to refer to me as a 'hot young radical' the only word I object to is 'radical' ;)

  2. sabroni Silver badge

    Where were all the browsers in that list of tested software? Can we assume from their omission that they all parse JSON consistently?

    1. Spudley

      > Where were all the browsers in that list of tested software?

      I was going to ask that. I notice that there's a single column for "JavaScript", but it's just the one, and it doesn't specify which browser/JS engine he used. It's a pretty good bet that they all have their own bugs and quirks, particularly when you're going down the the kind of level of detail that he's testing at. Heck, he probably should have tested multiple individual versions of each of them as well like he did for Swift and PHP.

      Talking of PHP, it's interesting that it seems to have come out with the best results here.

      1. TRT Silver badge

        True. The latest PHP parser is, by the looks of it, the "best". Though I note it has the most "expected result", but if the specification is so lax, that I guess the expectation is HIS expectation. I'm willing to lay money on him being a cut and dyed brought up on PHP coder.

        1. Anonymous Coward
          Anonymous Coward

          The PHP JSON parser is crap too. I've caught it red-handed.

  3. gnasher729 Silver badge

    I'm a bit confused here.

    There is nothing in the JSON spec that wouldn't allow 500 nested arrays, so claiming that a parser fails because it parses such an array is nonsense.

    BOM's at the beginning of the JSON document are absolutely to be expected.

    What they call "illegal Unicode characters" like U-FFFE are actually perfectly legal Unicode - not in the very first Unicode standard, but making them illegal caused so much trouble that they are nowadays legal Unicode characters.

    1. Anonymous Coward
      Anonymous Coward

      Re: I'm a bit confused here.

      Also (my emphasis):

      Parsers also have to handle raw bytes that don't encode Unicode characters. For instance, the byte FF does not represent a Unicode character in UTF-8. As a consequence, a string containing FF is not an <u>UTF-8 string</u>. In this case, parsers should simply refuse to parse the string, because "<u>A string</u> is a sequence of <u>zero or more Unicode characters</u>" RFC 7159 section 1 and "JSON text SHALL be encoded in Unicode RFC 7159 section 8.1.

      I think this is open to interpretation. I would have considered a string containing raw bytes not encoding Unicode characters to be "a sequence of zero [...] Unicode characters" and therefore pass the definition of "A string" in the referenced spec. Note the difference between "a string" and "a UTF-8 string".

  4. tiggity Silver badge

    JS drives JSON use

    I prefer other more rigorous formats to JSON (e.g. XML with published schema to validate against)

    However, JSON tends to be the popular choice in lots of projects as e.g. (let's assume REST API endpoint communication with "backend") endpoint can return JSON format data and client side javascript can easily handle it and do something with it (and because it's easy to package data as JSON from Javascript also tends to mean popular to have client sending JSON data to the endpoints so backend code needs to unwrap JSON).

    JSON popularity (IMHO) mainly due to the desire for doing as much as possible via client side JavaScript as allows just part of a page to be updated rather than whole page round trip refresh (albeit often with unwanted side effects for a user where e.g. typically impossible to bookmark the content you want on a page that has a lot of JavaScript updates).

    But hipsters love it (even though its a total PITA to test sites that are heavily client side JavaScript dependent compared to having more content rendering done on the backend & having to deal with cross browser JavaScript quirks or use bloaty frameworks such as jQuery) as you can get something up & running quick & looks nice which is what lots of clients want as too often the decision makers are more bothered about shiny than scalability, security etc.

    1. Destroy All Monsters Silver badge
      Thumb Up

      Re: JS drives JSON use

      Upvotery occurs

      XML with all its warts for serious stuff

      JSON for when quick webcrud is required because the customer (or the boss) asked for something to be done quickly (or with no money forthcoming), and where foisting technical debt on the next intern is standard operating procedure.

      (But why is there an image of the Golden Ant leading this article?)

      1. Andrew Commons

        Re: The Golden Ant

        Mythology maybe...Jason and the Golden...

        1. TRT Silver badge

          Re: The Golden Ant

          JSON & the R Go 0s.

      2. Anonymous Coward
        Holmes

        Re: JS drives JSON use

        Yeah... JSON is a disappointment. Web developers jumped on it because it was (and still is) the only widely-supported alternative to XML with its serious complexity/consistency/security problems. Among my circles we started using JSON in production about 3 years ago. Once exposed to the real world, the bugs began to bite almost immediately. It didn't work out. I still use JSON for AJAX stuff but not much else.

        I think we've only seen the tip of the JSON iceberg. Buggy and inconsistent implementations, error-prone quoting & escaping, the frequent need to embed JSON in other error-prone web markup formats (and vice versa), and the general sloppiness of web code.... it's a meltdown just waiting to happen... though probably not before IoT DDoSageddon.

        String quoting/markup is the fundamental flaw here. The Unix philosophy (text only) led us down the wrong track. Binary data serialization is much simpler.

    2. Matt Bryant Silver badge
      Alert

      Re: tiggity Re: JS drives JSON use

      ".....hipsters....as you can get something up & running quick & looks nice which is what lots of clients want as too often the decision makers are more bothered about shiny than scalability, security etc." COUGH * agile development * COUGH.

    3. Anonymous Coward
      Anonymous Coward

      Re: JS drives JSON use

      > (e.g. XML with published schema to validate against)

      Care to show an example of your code? I smell bullshit here, but it may just be incompetence instead.

      As I have posted above, I have experience with safety of life systems. At one point there was in the industry the bright idea that certain information should be exchanged using a "strictly specified", schema-validated, XML format. This introduced huge extra complexity and lots more potential points of failure. Most of the problems in the end came from the specification itself, which turned out to be inadequate for the task, causing some implementers to make dangerous decisions in an effort to comply with the specs, knowing full well that those were not fit for purpose.

      That sort of stuff can work in very local environments. When you have larger, more complex systems, you need to handle things at a more strategic level with general principles rather than detailed specs, and then evolve iteratively as the product matures.

  5. MatsSvensson

    Easy!

    Just use excel-flies for everything.

    I hear Microsofts specification is about the size of the bible.

    Nice and standardized.

    1. Hero Protagonist

      Re: Easy!

      "excel-flies"

      Are those the insects that buzz around a steaming pile of Excel?

  6. jillesvangurp

    I actually double checked whether there was any reason to get concerned for me. Turns out there isn't, of course. I'm using Jackson, like most java shops would. Turns out, all the supposed problems boil down to issues handling UTF-16. Simple suggestion: don't do UTF-16 and if you do, do it properly. The spec says json is UTF-8, no matter what individual vendors (cough MS cough) pretend. If you do insist on UTF-16, just configure the damn Reader and Writer correctly and don't rely on default encodings, ever (there is no such thing). The rest of the issues are essentially variants of things failing to NOT parse instead of failing when using comments. Arguably this is a feature, not a bug.

    So basically bog standard JSON without any encoding weirdness and comments, will parse. Every time. If not file a bug.

    So this story is basically that somebody found some interesting edge-cases with several parsers. Somehow this snowballed into this bullshit. Some of these issues might legitimately be filed as bugs (e.g. the bash parser crashing). But most of this will likely fall in the 'meh wontfix' category instead of the 'OMG INTERNET IS BROKEN' category. The world is in fact not ending and there is zero reason to switch data formats, upgrade parsers, or even read this article, over this for the vast majority of the supposedly user base (world + dog).

    1. Androgynous Cupboard Silver badge

      If you read his article you'll notice he refers to RFC7159, which states "JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.".

      No, this isn't in the original "spec", so if you were working from that it wouldn't be a hard fail. But it is in one of the specs that claim to define JSON so is a reasonable thing to test.

      Snowballed into this bullshit? I'll be generous and assume you are unfamiliar with the process of "testing your code", but working from a collection of edge-cases is pretty much the definition of testing when you come to implement a specification. I have worked from plenty of specifications without them and they are all, without exception, bad specifications. Words are always ambiguous, a test case that passes or fails is not.

      "most Java shops use Jackson", oh I don't think so. We were so dissatisified with that, and the various other half-baked or over-baked options that we wrote our own, which is now passing all but a few outliers thanks to the efforts of Mr. Seriot, to whom I am much obliged for his efforts.

    2. Boothy

      You seem to be stating that Jackson is failing simply because of the use of UTF-16, and stating use UTF-8 to 'fix' the issue!

      Try reading the actual specs (or the referenced article).

      Quote: "JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8".

      If Jackson is failing due to the encoding alone, then the failure is with Jackson, not the encoding.

  7. Anonymous Coward
    Anonymous Coward

    Ok, so I actually read Seriot's blog

    Which is a nice bit of research. His findings can be summarised in two fields:

    * Something is defined in a specification but one or more implementations do not conform.

    * Something is not defined in a specification, and implementations differ in their course of action.

    * Different specs conflict with each other.

    So? This is all, to the point that it is relevant, practical, economic, and within scope, controlled for at the project requirements level. For some projects it'll be a big deal, for others it'll be irrelevant, for most it'll fall somewhere in between.

    Note that we could be talking about any technology whatsoever here. General sound engineering principles still apply.

    1. Tom 7

      Re: Ok, so I actually read Seriot's blog

      So some implementations need updating to conform to the well defined bits and the Specification needs to be updated to define the non-defined bits people are working with.

      Or we could just have a meltdown and run around waving our hands about screaming the world is going to end.

    2. Anonymous Coward
      Anonymous Coward

      Re: Ok, so I actually read Seriot's blog

      So far he's only tested for obvious weaknesses in many different languages/implementations, not for all possible weaknesses. He listed several possibilities at the end.

      But what he's really trying to say is,

      As a final word, I keep on wondering why "fragile" formats such as HTML, CSS and JSON, or "dangerous" languages such as PHP or JavaScript became so immensely popular. This is probably because they are easy to start with by tweaking contents in a texts editor, because of too liberal parsers or interpreters, and seemingly simple specifications. But sometimes, simple specifications just mean hidden complexity.

      1. Anonymous Coward
        Anonymous Coward

        Re: Ok, so I actually read Seriot's blog

        > But what he's really trying to say is,

        As a dialectic point, that's not what he's "trying to say". That, a verbatim quote of his concluding remarks, is what he has actually said.

        And btw, he's perfectly welcome to go consistently choosing whichever openly complex, overspecified technologies he fancies if he thinks that's the best way to deliver on spec, on budget, and on schedule.

        1. Anonymous Coward
          Anonymous Coward

          Re: Ok, so I actually read Seriot's blog

          Well, he could've just ranted that everything is crap, but he had to tear apart JSON and Unicode in order to be taken seriously. Yep, he's a Unicode hater too. This guy rocks.

          > And btw, he's perfectly welcome to go consistently choosing whichever openly complex, overspecified technologies he fancies

          But instead he chose JSON, which is none of those things. *cough cough*

  8. thames

    Python Results

    I checked the results for Python 3.5 (which is what I use), and I don't see much of a practical problem. The issues seem to mainly come down to the parser not rejecting invalid unicode, and the handling of extreme floating point numbers.

    The author puts the test results into the following categories:

    expected result

    parsing should have succeeded but failed - Python 3.5 = 2

    parsing should have failed but succeeded - Python 3.5 = 3

    result undefined, parsing succeeded - Python 3.5 = 15

    result undefined, parsing failed - Python 3.5 = 5

    parser crashed - Python 3.5 = 0

    timeout - Python 3.5 = 0

    The 2 "parsing should have succeeded but failed" were two versions (big endian and little endian) of obscure utf16 strings which the author feels should have been interpreted as empty strings but which the parser rejected (at least that's what it did when I tested it). Nearly every other parser "failed" these tests for the same reason. When nearly everyone else does things one way even though the spec implies something different, you're probably better off going with the crowd. I can't say I can really argue with how Python handles it.

    The 15 "result undefined, parsing succeeded" are also mainly obscure unicode conversions. You can say that JSON "shall be unicode" until your blue in the face, but if someone sends you Windows 1252 or ISO-Latin-1, having the parser simply reject it is going to cause you nothing but grief in the real world. You're better off getting the "invalid" unicode into your program and handling it according to whatever is appropriate for your situation (if in fact is actually is a problem for you). The best place to handle "bad data" (assuming the data is even a problem for your application) may be elsewhere in your program rather than in the JSON parser. The article itself admits that this is not necessarily wrong. Like in the above case, the majority of other parsers handle these in the same way.

    He also very arbitrarily "fails" Python in this category because it didn't reject a deeply nested array (500 deep), despite the spec saying that there is no limit. Any limit is implementation dependent. Since any computer will have hardware limits to how much data it can handle, that limit is inherently arbitrary. The author is clearly wrong in this case.

    The 3 "parsing should have failed but succeeded" were related to the parser handling NaN, -infinity, and infinity instead of rejecting them. Whether or not you want the JSON parser to accept those is up to the programmer as this is a parameter in the function call whether to raise an exception when it encounters them. If you don't want to be able to handle them, then disable it. The author admits that he "failed" python on these three tests simply because he feels that most people would set it to accept NaN, -infinity, and infinity. I have to give the author's decision on this a big WTF?

    The 5 "result undefined, parsing failed" again seemed to be similar to the 2 "parsing should have succeeded but failed" cases. That is the parser rejected the data instead of silently returning empty data structures. Every other JSON parser also "failed" on this one. Again, from a practical standpoint I can't argue with the way that Python handles it.

    Python 3.5 did not crash or have time-outs on any of the tests. This is a very big plus in my book.

    The problem with articles that purport to test some feature with every common language is that the author usually doesn't understand all the languages themselves in any great depth, and he often accepts the design choices of his favourite language as being the "right" way to handle things.

    1. Anonymous Coward
      Anonymous Coward

      Re: Python Results

      Nice post, thames.

      I completely agree that, while these tests are always interesting, his conclusions are rather arbitrary.

      1. Anonymous Coward
        Anonymous Coward

        Re: Python Results

        Nonetheless, they show the need for a single, complete specification.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon