Discussion:
TJSON: Tagged JSON with Rich Types
(too old to reply)
Tony Arcieri
2016-10-25 22:15:49 UTC
Permalink
Raw Message
I wanted to give LANGSEC a sneak peek of a project I've been working on
with Ben Laurie before circulating it more widely:

https://www.tjson.org/

It's a set of security-oriented type annotations added to JSON. The idea is
to support cross-format content hashes which are the same regardless of if
data is serialized in a binary format like Protobufs, MessagePack, or BSON,
or in TJSON. The intended content hash algorithm is Ben Laurie's objecthash:

https://github.com/benlaurie/objecthash

We have also disallowed some of the more notable sharp edges for JSON
security, such as repeated member names in JSON objects. If there are any
other notable problems you think should be addressed, I'd be curious to
hear them.
--
Tony Arcieri
Jeffrey Goldberg
2016-10-26 06:16:13 UTC
Permalink
Raw Message
Sent from my iPad
Post by Tony Arcieri
https://www.tjson.org/
Thank you! I have wanted something like this to exist.
Post by Tony Arcieri
If there are any other notable problems you think should be addressed, I'd be curious to hear them.
If the UTF8 strings aren't normalized, you will get different hashes for visually and semantically identical strings.

Cheers,

-j
Sven M. Hallberg
2016-10-26 10:19:54 UTC
Permalink
Raw Message
Post by Tony Arcieri
https://www.tjson.org/
Neat!

You describe the generic form of "<tag>:..." in BNF, but you can also
describe all your higher-level requirements in the grammar. Are there
plans to produce a fully grammatical specification?

You make a point of the language being a subset of JSON which "can be
understood by existing JSON parsers". A grammar for the subset is needed
to perform proper recognition before processing by a generic JSON
parser.
Post by Tony Arcieri
If the UTF8 strings aren't normalized, you will get different hashes
for visually and semantically identical strings.
Along the same line, beware of surrogate pairs escape-encoded in the
string. E.g.:

"s:\uXXXX\uXXXX"

Here is the relevant piece of ABNF I once wrote for my JSON-like pet
project^1:

esc-unicode = u (u-basic / u-surro)

u-surro = u-surro-hi backslash u u-surro-lo
u-basic = (r0C / rEF) hexdig hexdig hexdig ; not D...
/ dD r07 hexdig hexdig ; D[0-7]..
u-surro-hi = dD r8B hexdig hexdig ; D[8-B]..
u-surro-lo = dD rCF hexdig hexdig ; D[C-F]..

; hex ranges
r0C = %x30-39 / %x41-43 / %x61-63 ; 0-9 A B C
rEF = %x45-46 / %x65-66 ; E F
r07 = %x30-37 ; 0-7
r8B = %x38-39 / %x41-42 / %x61-62 ; 8 9 A B
rCF = %x43-46 / %x63-66 ; C D E F

u = %x75 ; u
dD = %x44 / %x64 ; d D

^1: http://khjk.org/log/2012/jun/datalang.html


-pesco
Tony Arcieri
2016-10-26 16:34:33 UTC
Permalink
Raw Message
Post by Jeffrey Goldberg
If the UTF8 strings aren't normalized, you will get different hashes for
visually and semantically identical strings.
Unicode normalization is presently an optional flag in objecthash, but
should be on by default (I think?) and supported by all implementations.
--
Tony Arcieri
Jeffrey Goldberg
2016-10-26 18:48:48 UTC
Permalink
Raw Message
Post by Tony Arcieri
Post by Jeffrey Goldberg
If the UTF8 strings aren't normalized, you will get different hashes for visually and semantically identical strings.
Unicode normalization is presently an optional flag in objecthash,
I noticed that only after sending my message.
Post by Tony Arcieri
but should be on by default (I think?) and supported by all implementations.
Yep.

Again, thanks for getting this started. This has been something I’ve been concerned about for a while, but not sufficiently concerned about to actually act on.

Cheers,

-j
Tony Arcieri
2016-10-26 17:09:17 UTC
Permalink
Raw Message
Some serendipitous timing here, I just saw this article "Parsing JSON is a
Minefield":

http://seriot.ch/parsing_json.html

It shows how dramatically differently various parsers handle various types
of malformed JSON. One of the things I've been trying to put together in
TJSON is a comprehensive set of test cases that ensure conforming parsers
have the same behavior:

https://github.com/tjson/tjson-spec/blob/master/draft-tjson-examples.txt
--
Tony Arcieri
Tony Arcieri
2016-11-02 17:50:47 UTC
Permalink
Raw Message
I just published a blog post about TJSON here:

https://tonyarcieri.com/introducing-tjson-a-stricter-typed-form-of-json
Post by Tony Arcieri
I wanted to give LANGSEC a sneak peek of a project I've been working on
https://www.tjson.org/
It's a set of security-oriented type annotations added to JSON. The idea
is to support cross-format content hashes which are the same regardless of
if data is serialized in a binary format like Protobufs, MessagePack, or
BSON, or in TJSON. The intended content hash algorithm is Ben Laurie's
https://github.com/benlaurie/objecthash
We have also disallowed some of the more notable sharp edges for JSON
security, such as repeated member names in JSON objects. If there are any
other notable problems you think should be addressed, I'd be curious to
hear them.
--
Tony Arcieri
--
Tony Arcieri
Tony Arcieri
2016-11-03 03:20:26 UTC
Permalink
Raw Message
Based on feedback and various discussions I've had, I've come up with an
interesting proposal:

https://github.com/tjson/tjson-spec/issues/30

Instead of embedding type information in a string-by-string basis:

{"s:foo": "s:bar"}

We can use objects as the one source-of-truth for all type information,
mandating an object as the only allowable root expression in the grammar,
and embedding a type signature as a postfix tag in the name/key of each
object member:

{"foo:s": "bar"}

In this scheme, objects act as a self-describing record-like product type.
They describe the rest of the type information in what approaches a
somewhat familiar type signature syntax.

Previously the TJSON spec allowed arrays as a root expression as well,
however this syntax prevents arrays from

This means the only type allowed for member names is a string (which seems
fine to me). It slightly complicates some other TJSON use cases, such as
redaction. But otherwise I think it's an improvement.

Moving the type signature to a postfix tag like this means we can add
additional interpretations of arrays in a non-hacky way. Sets were proposed
as another non-scalar type, but incorporating them before was a bit non
hacky. Now we can define a syntax for non-scalar types, and to disambiguate
them let's use upper case type names:

- A: array
- S: set

Now we can specify a set of strings as something like a C++/Java-ish syntax:

{"words:S<s>": ["foo", "bar", "baz"]}

Or a multi-dimensional array of integers as:

{"dialpad:A<A<i>>": [["1","2","3"], ["4","5","6"], ["7","8","9]]}

Perhaps an array of objects?

{"myobjects:A<O>": [{"foo:i":"1"},{"bar:i":"2"},{"baz:i":"3"}]}

I think this is an improvement, and also enforces homogenous types, which
would be more amenable to mapping to static type systems.
--
Tony Arcieri
Sven M. Hallberg
2016-11-03 09:46:04 UTC
Permalink
Raw Message
Post by Tony Arcieri
{"foo:s": "bar"}
Suddenly your grammar for the value depends on a piece of information
inside the key...
Post by Tony Arcieri
This means the only type allowed for member names is a string (which seems
fine to me).
This one I would actually suggest for consideration in the original
form. {"s:x": "s:foo", "s:y": "s:bar", "s:z": "s:baz"} just seems kind
of silly.
Post by Tony Arcieri
{"dialpad:A<A<i>>": [["1","2","3"], ["4","5","6"], ["7","8","9]]}
Now this looks definitely context-sensitive. One nested structure on the
right of the ':' depending on another to the left. You can no longer get
away with a grammar but you'll have all the fun of a type system.

Also I'm sure some will want their heterogenous lists back.
Post by Tony Arcieri
{"myobjects:A<O>": [{"foo:i":"1"},{"bar:i":"2"},{"baz:i":"3"}]}
Wait a minute, why are you stopping at objects with the type
refinement? Shouldn't you put your entire schema into the type?

Obviously, that's half ironic/rhethoric, but it seems clear that this
scheme is a complication of the original, so not a Pareto-efficient
improvement.


-pesco
Jeffrey Goldberg
2016-11-04 01:50:44 UTC
Permalink
Raw Message
Post by Sven M. Hallberg
Post by Tony Arcieri
{"dialpad:A<A<i>>": [["1","2","3"], ["4","5","6"], ["7","8","9]]}
Now this looks definitely context-sensitive.
There is a lot of stuff that is between context-free and context-sensitive.
Post by Sven M. Hallberg
One nested structure on the
right of the ':' depending on another to the left. You can no longer get
away with a grammar but you'll have all the fun of a type system.
I think that this can be addressed with what I think we used to call
“index grammars”. When it was convincingly shown that natural languages
contained constructions that could not be handled by CFGs, only a relatively
small increase in the power of the grammar was needed to handle that.

This is stuff that I studied in the mid 80s, and will have to look it up, but that
construction looks like it has the same sorts of formal properties as reduplication
in Bambara or the cross-serial dependencies in Swiss German.

I will try to look this stuff up later this evening or sometime tomorrow (Friday).

Cheers,

-j
Tony Arcieri
2016-11-04 02:08:37 UTC
Permalink
Raw Message
Post by Sven M. Hallberg
Post by Tony Arcieri
{"dialpad:A<A<i>>": [["1","2","3"], ["4","5","6"], ["7","8","9]]}
Now this looks definitely context-sensitive. One nested structure on the
right of the ':' depending on another to the left. You can no longer get
away with a grammar but you'll have all the fun of a type system.
The grammar is certainly still context-free:

<member> ::= <tagged-string> <name-separator> <value>
<tagged-string> ::= '"' *<char> ':' <tag> '"'
<tag> ::= <non-scalar-expr> | <scalar-tag>
<non-scalar-expr> ::= <non-scalar-tag> '<' <tag> '>'
<non-scalar-tag> ::= <alpha-upper> *<alphanumeric-lower>
<scalar-tag> ::= <alpha-lower> *<alphanumeric-lower>
<alphanumeric-lower> ::= <alpha-lower> | <digit>
<alpha-upper> ::= 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' |
'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' |
'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' |
'Y' | 'Z'
<alpha-lower> ::= 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' |
'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' |
'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' |
'y' | 'z'
<digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<alphanumeric-lower> ::= <alpha> | <digit>

But, as you noted, this does add a sort of type system to the language,
such that it's now possible to express documents which don't typecheck.

I agree this makes the format more complicated, but it does make the format
more amenable to mapping to statically typed programming language. Also,
it's a rather simple type system, and one that can typecheck things in the
same pass as processing it (I believe, I'm still yet to implement it).

Wait a minute, why are you stopping at objects with the type
Post by Sven M. Hallberg
refinement? Shouldn't you put your entire schema into the type?
Objects as self-describing product types, so no further type information is
necessary.
--
Tony Arcieri
Tony Arcieri
2016-11-04 02:09:24 UTC
Permalink
Raw Message
Err whoops, ignore that extra <alphanumeric-lower> ::= <alpha> | <digit> at
the bottom ;)
Post by Tony Arcieri
Post by Sven M. Hallberg
Post by Tony Arcieri
{"dialpad:A<A<i>>": [["1","2","3"], ["4","5","6"], ["7","8","9]]}
Now this looks definitely context-sensitive. One nested structure on the
right of the ':' depending on another to the left. You can no longer get
away with a grammar but you'll have all the fun of a type system.
<member> ::= <tagged-string> <name-separator> <value>
<tagged-string> ::= '"' *<char> ':' <tag> '"'
<tag> ::= <non-scalar-expr> | <scalar-tag>
<non-scalar-expr> ::= <non-scalar-tag> '<' <tag> '>'
<non-scalar-tag> ::= <alpha-upper> *<alphanumeric-lower>
<scalar-tag> ::= <alpha-lower> *<alphanumeric-lower>
<alphanumeric-lower> ::= <alpha-lower> | <digit>
<alpha-upper> ::= 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' |
'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' |
'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' |
'Y' | 'Z'
<alpha-lower> ::= 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' |
'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' |
'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' |
'y' | 'z'
<digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<alphanumeric-lower> ::= <alpha> | <digit>
But, as you noted, this does add a sort of type system to the language,
such that it's now possible to express documents which don't typecheck.
I agree this makes the format more complicated, but it does make the
format more amenable to mapping to statically typed programming language.
Also, it's a rather simple type system, and one that can typecheck things
in the same pass as processing it (I believe, I'm still yet to implement
it).
Wait a minute, why are you stopping at objects with the type
Post by Sven M. Hallberg
refinement? Shouldn't you put your entire schema into the type?
Objects as self-describing product types, so no further type information
is necessary.
--
Tony Arcieri
--
Tony Arcieri
Sven M. Hallberg
2016-11-04 12:38:35 UTC
Permalink
Raw Message
Post by Tony Arcieri
Post by Sven M. Hallberg
Post by Tony Arcieri
{"dialpad:A<A<i>>": [["1","2","3"], ["4","5","6"], ["7","8","9]]}
Now this looks definitely context-sensitive. One nested structure on the
right of the ':' depending on another to the left. You can no longer get
away with a grammar but you'll have all the fun of a type system.
Indeed, the grammar for a conveniently chosen context-free superset is
still context-free. The actual language is not.
Post by Tony Arcieri
But, as you noted, this does add a sort of type system to the language,
such that it's now possible to express documents which don't
typecheck.
But more: You'll find you can't parse without "typechecking" anymore.
Surely you agree that parsing an integer is not the same as parsing
base64. Maybe there is an elegant grammatical formalism to describe your
language, but that doesn't change my point: It's a significant
complication over before.

For one thing, it's harder to write a parser now that reuses an existing
JSON framework. Before you could do this:

1) Full recognition by an automaton autogenerated from a CFG.
(Also yay decidable equivalence.)
2) Interpretation by existing JSON parser.
3) Simple visitor pattern on result to convert tagged strings to their
native representations.
Post by Tony Arcieri
Objects as self-describing product types, so no further type information is
necessary.
...for parsing. I think your new proposal is less elegant than the
original because it confuses syntax and semantics. Above, you treat
things like semantics that should be syntax (valid forms of values,
depending on tag) and here you leave things that could be useful at a
semantic level (expected object structure) out because they are
unnecessary for syntax.

To be clear, I wouldn't advocate putting an entire schema into the tag.
I would advocate not putting half of one in it either.


Of course I just offer my thoughts in the hope that they are of some
insight.

-pesco
Tony Arcieri
2016-11-04 23:15:37 UTC
Permalink
Raw Message
Post by Sven M. Hallberg
For one thing, it's harder to write a parser now that reuses an existing
1) Full recognition by an automaton autogenerated from a CFG.
(Also yay decidable equivalence.)
2) Interpretation by existing JSON parser.
3) Simple visitor pattern on result to convert tagged strings to their
native representations.
My understanding is parsers like Hammer can still handle these cases in one
pass (I think?). Would love to know!

Some quick BNF describing <member> and <tagged-string> according to:
https://tjson.org/spec/#rfc.section.2.1

<member> ::= <tagged-string> <name-separator> <value>
<tagged-string> ::= '"' *<char> ':' <tag> '"'

Unfortunately I don't have a well-defined grammar for <value>, as my
current definitions are somewhat colluded with the ABNF definition of JSON
in RFC 7159. I should definitely produce a full grammar! But you can
imagine it as being a sort of toplevel symbol.

To parse and typecheck TJSON in one pass, it would involve obtaining the
parse tree for the LHS of parsing a particular nonterminal and pass it to
the pushdown automaton parsing the RHS as a sort of parametric argument
along with the remaining unconsumed tokens.

At each frame of the stack, the pushdown automaton continues its way
towards the terminals, but you unwrap a bit of the parse tree parameter and
pass it along with the next pushdown the automaton is consuming, so long as
the type signature is for a non-scalar value.

When the pushdown automaton has reached the terminals and have almost
finished extracting a node on the parse tree, before we return the parsed
node we call a small guard/validation function which takes two nodes of the
parse tree as arguments, where one is the type signature for the current
node, and the other is the parsed value.

A tl;dr version:

- For a particular nonterminal, I want to have a "parameterized" pushdown
automaton that uses LHS to assist parsing RHS, by passing the parse result
for LHS to the parser for RHS
- I want to add what are effectively "postconditions" to that pushdown
automaton which use something approaching boolean algebra to ensure the
result is valid

This sounds context-sensitive to me, I guess. But even if it is, all it's
doing is using type information on LHS to enrich the parsing of RHS.
Certainly there's ample precedent for doing that sort of thing in the
innumerable statically typed languages out there? If it's
context-sensitive, it seems like a very boring kind of context sensitivity.
But IANAL (I Am Not A Linguist)

These are exactly the kind of cases I think parser combinator libraries are
made for.

If not, making a second pass to typecheck the parse tree doesn't seem so
bad either.

There's a completely different approach I'll be using in the Ruby
implementation. It's a bit wacky, but I think it works out.
--
Tony Arcieri
Sven M. Hallberg
2016-11-07 10:07:51 UTC
Permalink
Raw Message
Post by Tony Arcieri
To parse and typecheck TJSON in one pass, it would involve obtaining the
parse tree for the LHS of parsing a particular nonterminal and pass it to
the pushdown automaton parsing the RHS as a sort of parametric argument
along with the remaining unconsumed tokens.
This sounds like monadic bind (>>=) which we have in Hammer (I
implemented it), but it is for obvious reasons the single most general
combinator.

Tony Arcieri
2016-11-05 02:31:19 UTC
Permalink
Raw Message
I've updated the TJSON web site with a description of the new format:

https://www.tjson.org/
--
Tony Arcieri
Loading...