Authenticated Data Structures, as a Library, for Free!

Let's assume that you're querying to some database stored in the cloud (i.e., on someone else’s computer). Being of a sceptical mind, you worry whether or not the answers you get back are from the database you expect. Or is the cloud lying to you?

Authenticated Data Structures (ADSs) are a proposed solution to this problem. When the server sends back its answers, it also sends back a “proof” that the answer came from the database it claims. You, the client, verify this proof. If the proof doesn't verify, then you’ve got evidence that the server was lying. If the proof does verify, then there is a guarantee that the server’s response is legitimate (usually up to the possibility of a hash collision).

This all seems great, but doesn’t address the question of how anyone might build an ADS, and, crucially, prove that it has the security of unforgable (up to hash collisions) verification. A brute-force way to implement an ADS is for the client to retain a copy of the database, and to check the server’s results against the copy. But then what would be the point of the server?

Merkle Trees are the original example of an ADS. Merkle trees solve the problem of needing a complete copy of the database by having the client store a hash of the database. The server’s proof is then verified against the hash. But what if we want a data structure that isn’t trees? Will we have to implement our own ADS? How will we know that we have done it correctly? Implementing a completely new ADS has three problems. We need: to invent a way to do the authentication, a proof that authentication has been done correctly, and a proof that we have correctly implemented it.

Andrew Miller, Michael Hicks, Jonathan Katz, and Elaine Shi have a solution to this meta-problem in their POPL2014 paper Authenticated Data Structures, Generically, by describing a new language “Lambda-Auth” for implementing correct-by-construction ADSs.

Miller et al. describe a programming language with special constructs for writing ADSs. Once you’ve written an implementation of your data structure, you annotate the implementation with ‘authentication’ markers that indicate points in the structure that act as authentication checkpoints. On the client (verifier) side, each checkpoint becomes a hash code representing the real data, which will be checked against the proof sent by the server. On the server (prover) side, each checkpoint is an indicator of where to generate a piece of proof. The key insight of Miller et al. is that the client and server run the same code, just with two different interpretations of what the authentication checkpoints mean. They are then able to prove, for all programs, that the server and client sides will always agree, and that proofs of authentication are unforgable (up to hash collisions). The authors wrote a PL Enthusiast blog post with a gentle introduction, and a second part with more detail.

In this post, I'll show that in a language with sufficiently powerful abstraction facilities (in this case OCaml), it is possible to implement Miller et al.’s solution as a library within the language, with no need to create a new language, or to alter an existing language’s implementation.

Moreover, although I won’t give any details in this blog post, I claim that the correctness proof given by Miller et al. will be an instance of parametricity for higher-kinded types. I’ll give a few sketchy details at the end.

Authenticated Data Structures as a Library

To get started, I need to isolate a description of the bits and pieces we need to be able to write authenticated data structures, taking Miller et al.’s paper as a guide. OCaml provides module signatures as a handy way of collecting together requirements for the existence of certain types and functions. I'll call the types and functions we need for making ADSs an AUTHENTIKIT: a kit for implementing authenticated data structures:

module type AUTHENTIKIT = sig

The first thing is to postulate the existence of the type constructor auth that represents the type of authenticated values. This is the OCaml rendering of the “•” type constructor used in Miller et al.’s language for representing authenticated values. Note that I haven’t committed to any implementation of this type: it is left completely abstract in the AUTHENTIKIT interface:

  type 'a auth

We’ll need is a way of describing abstract authenticated computations: computations that are either generating proofs or verifying proofs. In Miller et al.’s setup, authenticated computations are built in to the language. However, OCaml comes with with its own fixed notion of computation, so we have to use a monad to layer our own notions on top:

  type 'a authenticated_computation
  val return : 'a -> 'a authenticated_computation
  val (>>=)  : 'a authenticated_computation ->
               ('a -> 'b authenticated_computation) ->
               'b authenticated_computation

The third thing I’ll need is a way of proving that the types of data that we want to authenticate are “authenticatable”. This will essentially mean that they are serialisable to a string representation in way that is suitable for proof construction and verification. The requirement to be serialisable is hidden in Miller et al.’s formalism because they assume that all data in the language is serialisable. This is a somewhat fishy assumption (how do you serialise and deserialise a function without breaking abstraction boundaries?), so here I’m being a bit more formal, and requiring the programmer to build evidence of serialisability using the following combinators tucked away in a submodule of any AUTHENTIKIT implementation:

  module Authenticatable : sig
    type 'a evidence
    val auth   : 'a auth evidence
    val pair   : 'a evidence -> 'b evidence -> ('a * 'b) evidence
    val sum    : 'a evidence -> 'b evidence ->
                      [`left of 'a | `right of 'b] evidence
    val string : string evidence
    val int    : int evidence
  end

Finally, I postulate the existence of the auth and unauth functions as Miller et al. do. The difference here is that I have explicitly requested evidence that the data type involved is authenticatable. Also, the unauth function returns a computation in our authenticated computation monad, indicating that this is where some of the work to get an authenticated data structure to work will happen.

  val auth   : 'a Authenticatable.evidence -> 'a -> 'a auth

  val unauth : 'a Authenticatable.evidence ->
               'a auth ->
               'a authenticated_computation
end

An Autheticated Data Structure: Merkle Trees

Before I describe the prover and verifier implementations of the AUTHENTIKIT module type, I’ll give an example of an authenticated data structure implementation. I’ll do a basic Merkle tree, with the following interface:

module type MERKLE =
  functor (A : AUTHENTIKIT) -> sig
    open A

    type path = [`L | `R] list
    type tree = [`left of string | `right of tree * tree] auth

    val make_leaf : string -> tree
    val make_branch : tree -> tree -> tree
    val retrieve :
	  path -> tree -> string option authenticated_computation
    val update :
	  path -> string -> tree -> tree option authenticated_computation
  end

The first thing to note about this module signature is that its implementations are parameterised by implementations of AUTHENTIKIT. This will be characteristic of any authenticated data structure implementation we implement in this style: we need the freedom to instantiate the implementation with alternative AUTHENTIKITs to get the prover and verifier sides of the system.

The second thing to note is that the data structure interface is a very lightly annotated version of a completely boring binary tree interface. We have types for trees and paths, and operations to construct trees, query trees, and update trees. The only differences from a normal interface is that the tree type is annotated with an extra auth type constructor, indicating how the tree is augmented with authentication information, and the authenticated_computation type constructors on the retrieve and update operations, indicating that they are authenticated operations.

The implementation is likewise a straightforward implementation of a binary tree in OCaml, with a couple of extra annotations for the authentication. The Merkle module signature states that implementations are parameterised by AUTHENTIKITs, so we declare the Merkle implementation like so:

module Merkle : MERKLE =
  functor (A : AUTHENTIKIT) -> struct
    open A

The open A introduces all the members of the AUTHENTIKIT into scope, so we don't have to qualify any names. The first thing to do is to fulfil the promise of definitions for the types path and tree that we gave in the interface:

    type path = [`L|`R] list
    type tree = [`left of string | `right of tree * tree ] auth

Next, we construct some evidence that the body of the tree type is authenticatable, so we will be able to use auth and unauth on trees.

    let tree :
     [`left of string | `right of tree * tree] Authenticatable.evidence =
      Authenticatable.(sum string (pair auth auth))

The make_leaf and make_branch functions build leaves and branches using the appropriate constructors.

    let make_leaf s =
      auth tree (`left s)

    let make_branch l r =
      auth tree (`right (l,r))

To query our authenticated data structure, we have two functions retrieve and update. First, the implementation of retrieve which takes a path and a tree and returns the data identified by that path, if it exists.

    let rec retrieve path t =
      unauth tree t >>= fun t ->
      match path, t with
        | [],       `left s      -> return (Some s)
        | `L::path, `right (l,r) -> retrieve path l
        | `R::path, `right (l,r) -> retrieve path r
        | _,        _            -> return None

As I mentioned above, this is really nothing more than a standard binary tree search implementation, obfuscated by the need to unwrap the authenticated input, and the additional >>= and returns needed to track the authenticated computation. The update implemention is similar in its similarlity to a standard binary tree update function:

    let rec update path v t =
      unauth tree t >>= fun t ->
      match path, t with
        | [],       `left _      ->
           return (Some (make_leaf v))
        | `L::path, `right (l,r) ->
           (update path v l >>= function
             | None    -> return None
             | Some l' -> return (Some (make_branch l' r)))
        | `R::path, `right (l,r) ->
           (update path v r >>= function
             | None    -> return None
             | Some r' -> return (Some (make_branch l r')))
      | _ ->
         return None

Finally, an end to complete the definition of Merkle:

end

Just to emphasise again, this implementation is essentially the same as a standard binary tree implementation written in OCaml. In the code, we didn’t have to mention anything to do with authenticated data structures, except for placing auth and unauth in the correct places. And even if we get in a muddle doing that, the OCaml type checker will helpfully inform us where we’ve made a mistake. One wrinkle is that I've had to give up writing in “direct style” and had to write in monadic style.

The Prover

The Merkle module is not yet ready for use. To use it we need an implementation of the AUTHENTIKIT interface. The different implementations of AUTHENTIKIT will correspond to the different semantics in Miller et al.’s presentation: the prover and the verifier.

The first implementation of AUTHENTIKIT I'll give is the prover.

Proofs

The prover constructs proofs, which I'll represent as lists of JSON values, using the Ezjsonm library.

type proof = Ezjsonm.value list

We will also need to compute cryptographic hashes of JSON values, which I'll do using the Cryptokit library. I'm using the SHA1 algorithm here, but obviously any secure hashing algorithm would work.

let hash_json =
  let hash_algo = Cryptokit.Hash.sha1 () in
  fun json_value ->
    Cryptokit.hash_string hash_algo (Ezjsonm.to_string (`A json_value))

Prover Implementation

The module Prover implements the AUTHENTIKIT signature, plus the additional knowledge that a) authenticated computations generate proofs as well as values, and b) that we have a get_hash function that returns the hashed representation of any authenticated value. The prover’s interface is captured by its module signature:

module Prover : sig
  include AUTHENTIKIT with type 'a authenticated_computation = proof * 'a
  val get_hash : 'a auth -> string
end = struct

First, we implement the prover’s view of authenticated values by defining its implementation of the auth type. The prover sees an authenticated value as a pair of an underlying value x and a hash of x’s representation:

  type 'a auth = 'a * string

  let get_hash (a,h) = h

It will be up to the rest of the functions in this module to maintain the invariant that the second half of the pair will always be the hash code of the serialised representation of the first half.

Authenticated computations on the prover side are represented using a specialisation of the standard Writer monad, which collects a proof (list of JSON values) as a side-effect. This implementation is quite slow, because I have used lists, but could be speeded up by using a better data structure.

  type 'a authenticated_computation =
    proof * 'a

  let return a =
    ([], a)

  let (>>=) (prf,a) f =
    let (prf',b) = f a in
    (prf @ prf',b)

The prover's view of authenticatable values are ones for which it is possible to serialise them to JSON. We represent the evidence that such a thing is possible as the existence of a function from values to JSON values.

  module Authenticatable = struct
    type 'a evidence = 'a -> Ezjsonm.value

    let auth (a,h) =
      `String h

    let pair a_serialiser b_serialiser (a,b) =
      `A [a_serialiser a; b_serialiser b]

    let sum a_serialiser b_serialiser = function
      | `left a  -> `A [`String "left"; a_serialiser a]
      | `right b -> `A [`String "right"; b_serialiser b]

    let string s = `String s
    let int i = `String (string_of_int i)
  end

In the auth case, we only serialise the hash code, not the underlying value. This ensures that the prover does not end up sending whole data structures back to the client. In the int case, I'm serialising OCaml’s int values as strings, to avoid complications arising from JSON’s use of floating point representations for numbers.

Now the auth and unauth functions. Creation of authenticated values means pairing the value with its hashed serialised representation:

  let auth serialiser a =
    (a, hash_json (serialiser a)))

Extracting the underlying value from an authenticated value has the “side effect” of producing a step in the proof, which is the JSON representation of the value we expect to see:

  let unauth serialiser (a, h) =
    ([serialiser a], a)

Finally, we complete the definition of Prover with an end:

end

Trying out the Prover

We get a prover-side Merkle tree implementation by instantiating the Merkle module with the Prover implementation of AUTHENTIKIT:

module Merkle_Prover = Merkle (Prover)

Let’s make a little prover-side tree, in the OCaml REPL:

# let tree =
    Merkle_Prover.(make_branch
                     (make_branch (make_leaf "a") (make_leaf "b"))
                     (make_branch (make_leaf "c") (make_leaf "d")));;
val tree : Merkle_Prover.tree = <abstr>

The returned value is abstract, because it is really a pair of the underlying data and it hash code. Using the prover’s get_hash function, we can get the hash code of the tree, which is what the verifier will use to authenticate the prover’s actions:

# let code = Prover.get_hash tree;;
val code : string = ".z\129w\199J\224\\\254\220\bo\246W\158\243S\029\177\190"

We can also ask the prover to do queries on the tree. For example, if we ask for the value at position “left, left”, it returns the result “a”, and a proof that the verifier will be able to use to check that this result actually came from the tree with the hash code above.

# let proof, result = Merkle_Prover.retrieve [`L;`L] tree;;
val proof : proof =
    [`A [ `String "right"
        ; `A [ `String "?\250m&,\251\129\031\r\252QJ\001\141|d}\242\016l"
             ; `String "i?B\230p\158D\201\248\145\000\1400p\224\018\023\219\1935"
             ]
        ]
    ; `A [ `String "right"
         ; `A [ `String "X\140\005\028\146\1891L\224\246\224\229\201\018o\b\187\163\240\160"
              ; `String "\223\231\194\230\1362=\157\187\226;?\143>\127\248\014;\201\254"
              ]
         ]
    ; `A [`String "left"; `String "a"]
    ]
val result : string option = Some "a"

This proof looks quite long, and in this case is several times the size of the original database. So it might look like Merkle trees might not save us anything. In general, however, the proof size is logarithmic in the size of the tree and so will be much smaller than sending the entire tree.

The Verifier

We’ve got a prover generating query responses and proofs, but this is a bit useless if we don’t have a verifier to check them.

Our verifier is another implementation of the AUTHENTIKIT signature, with the additional information that authenticated computations are now functions that consume proofs and return either a value (and possibly some left-over proof) or return failure. Also, we specify that, on the verifier side, an authenticated value is represented as just its hash code (represented as an OCaml string).

module Verifier : sig
  include AuthentiKit
    with type 'a authenticated_computation =
                   proof -> [ `Ok of proof * 'a | `ProofFailure ]
     and type 'a auth = string
end = struct

The basic idea behind the verifier is that, given a hash code and a proof, it checks the hash code against the proof, and then uses the hash code to rebuild the parts of the data structure that will be explored by the program.

We start by fulfilling our statement that the verifier’s view of authenticated values is as OCaml strings:

  type 'a auth = string

And that the verifier’s version of authenticated computations is as a “parser” of proofs. Note how the definitions here are very similar to the definitions used for parser combinators. In some sense, verification is a process of “parsing” the proof sequence supplied by the prover.

  type 'a authenticated_computation =
    json list -> [`Ok of proof * 'a | `ProofFailure]

  let return a =
    fun proof -> `Ok (proof, a)

  let (>>=) c f =
    fun prfs ->
      match c prfs with
        | `ProofFailure -> `ProofFailure
        | `Ok (prfs',a) -> f a prfs'

The verifier’s view of authenticable values is slightly more involved than the prover’s, because it needs to be able to serialise and deserialise values to and from JSON. There isn’t anything particularly special going on here, but we do have to be careful to ensure that this implementation uses the same format as the prover’s. (An obvious improvement is to make the Prover and Verifier implementations share an implementation of this sub-module.)

  module Authenticatable = struct
    type 'a evidence =
      { serialise   : 'a            -> Ezjsonm.value
      ; deserialise : Ezjsonm.value -> 'a option
      }

    let auth =
      let serialise h = `String h
      and deserialise = function
        | `String s -> Some s
        | _         -> None
      in
      { serialise; deserialise }

    let pair a_s b_s =
      let serialise (a,b) =
        `A [a_s.serialise a; b_s.serialise b]
      and deserialise = function
        | `A [x;y] ->
           (match a_s.deserialise x, b_s.deserialise y with
             | Some a, Some b -> Some (a,b)
             | _ -> None)
        | _ ->
           None
      in
      { serialise; deserialise }

    let sum a_s b_s =
      let serialise = function
        | `left a  -> `A [`String "left"; a_s.serialise a]
        | `right b -> `A [`String "right"; b_s.serialise b]
      and deserialise = function
        | `A [`String "left"; x] ->
           (match a_s.deserialise x with
             | Some a -> Some (`left a)
             | _ -> None)
        | `A [`String "right"; y] ->
           (match b_s.deserialise y with
             | Some b -> Some (`right b)
             | _ -> None)
        | _ ->
           None
      in
      { serialise; deserialise }

    let string =
      let serialise s = `String s
      and deserialise = function
        | `String s -> Some s
        | _         -> None
      in
      { serialise; deserialise }

    let int =
      let serialise i = `String (string_of_int i)
      and deserialise = function
        | `String i -> (try Some (int_of_string i) with Failure _ -> None)
        | _         -> None
      in
      { serialise; deserialise }
  end

Now we get to the crucial auth and unauth functions. Creation of authenticated values on the verifier side is nothing more than serialising them and computing their hash code:

  open Authenticatable

  let auth auth_evidence a =
    hash_json (auth_evidence.serialise a))

Finally, we get to the actual proof checking. The unauth function is supplied with a) a (de)serialiser for the type of value to be produced; b) a hash code for the expected value; and c) a proof which will be used to reconstitute the actual value. If we have run out of proof elements, then we fail immediately. Otherwise, we check that the hash code of the first item in the proof is the same as our required code. If so, we deserialise the proof item and return it with the remainder of the proof. If the hash codes do not match, then the verifier reports that proof checking has failed.

  let unauth auth_evidence h proof = match proof with
    | [] -> `ProofFailure
    | p::ps when hash_json p = h ->
       (match auth_evidence.deserialise p with
         | None   -> `ProofFailure
         | Some a -> `Ok (ps, a))
    | _ -> `ProofFailure

Finally, the Verifier ends with an end:

end

Trying out the Verifier

Above, we used the Merkle_Prover module to generate a small Merkle tree and run a query on it. We can now verify the prover’s execution of retrieve. Recall that we extracted a hash code representing the tree from the prover’s representation:

val code : string = ".z\129w\199J\224\\\254\220\bo\246W\158\243S\029\177\190"

Let’s assume that this code was conveyed to the client somehow, and the client trusts that it is an accurate representation of the tree it wants to query.

Now the client/verifier asks the server/prover to perform a query. The server sends back the result and a proof, which we computed above:

val proof : proof =
    [`A [ `String "right"
        ; `A [ `String "?\250m&,\251\129\031\r\252QJ\001\141|d}\242\016l"
             ; `String "i?B\230p\158D\201\248\145\000\1400p\224\018\023\219\1935"
             ]
        ]
    ; `A [ `String "right"
         ; `A [ `String "X\140\005\028\146\1891L\224\246\224\229\201\018o\b\187\163\240\160"
              ; `String "\223\231\194\230\1362=\157\187\226;?\143>\127\248\014;\201\254"
              ]
         ]
    ; `A [`String "left"; `String "a"]
    ]
val result : string option = Some "a"

We can now use Merkle_Verifier to verify the prover’s proof against code:

# Merkle_Verifier.retrieve [`L;`L] code proof;;
- : [ `Ok of proof * string option | `ProofFailure ] = `Ok ([], Some "a")

Attempting to trick the verifier

Let’s try simulating an attempt by the prover to trick the verifier by running the query against a different tree. First we create a tree with the same shape, but with different values in it:

let other_tree =
    Merkle_Prover.(make_branch
                     (make_branch (make_leaf "A") (make_leaf "B"))
                     (make_branch (make_leaf "C") (make_leaf "D")));;

Now we run the “left, left” query on this alternative tree, and we get back a result and a proof:

# let proof, result = Merkle_Prover.retrieve [`L;`L] other_tree;;
val proof : proof =
   [ `A [`String "right"
        ; `A [`String "Q\217G\000\246\238!\248\212\127\194\184\179>\017zW\0182("
             ; `String "0\239\238\002b4\172\145\127\143\002@-=g\179\197\022\154|"
             ]
        ]
   ; `A [`String "right"
        ; `A [`String "\157\134\234N\234QS\136\165\196\0038\133j\018uY\133\030\005"
             ; `String "\2272|\223M\230\241\132\167\029-\016\141\221\005nx\239\190\184"
             ]
        ]
   ; `A [`String "left"; `String "A"]
   ]
val result : bytes option = Some "A"

We send the result and proof back to the sceptical client, who verifies the proof against their hash code:

# Merkle_Verifier.retrieve [`L;`L] code proof;;
- : [ `Ok of proof * bytes option | `ProofFailure ] = `ProofFailure

It fails! The cheating server has been thwarted.

Correctness from Parametricity

Of course, the examples above don’t prove that the server/prover can never cheat. For their calculus, Miller et al. provide a proof that the prover and verifier implementations satisfy the following property, which I’ve stated informally here:

If the verifier accepts a proof (i.e., returns Ok) then either it came from the prover, or is the result of a hash collision.

If we assume that hash collisions are unlikely (Miller et al. make this assumption more formal), then we get the security property we want: up to the possibility of hash collision, we can spot cheating clouds.

I claim, but I’m not going to prove here, that this property is provable as an instance of the “Free Theorem” we get from the fact that the Merkle implementation is parameterised by AUTHENTIKIT implementations. The full proof involves F-ing Modules to translate the use of modules and functors into the higher-kinded calculus System Fω, my Relational Parametricity for Higher Kinds for the higher-kinded version of free theorems, and Katsumata’s A Semantic Formulation of TT-lifting and Logical Predicates for Computational Metalanguage to handle the authenticated computations monads.