==============
Security ideas
==============

These are some notes I (Armin) took after a talk at Chalmers by Steve
Zdancewic: "Encoding Information Flow in Haskell".  That talk was
presenting a pure Haskell approach with monad-like constructions; I
think that the approach translates well to PyPy at the level of RPython.


The problem
-----------

The problem that we try to solve here is: how to give the programmer a
way to write programs that are easily checked to be "secure", in the
sense that bugs shouldn't allow confidential information to be
unexpectedly leaked.  This is not security as in defeating actively
malicious attackers.


Example
-------

Let's suppose that we want to write a telnet-based application for a
bidding system.  We want normal users to be able to log in with their
username and password, and place bids (i.e. type in an amount of money).
The server should record the highest bid so far but not allow users to
see that number.  Additionally, the administrator should be able to log
in with his own password and see the highest bid.  The basic program::

    def mainloop():
        while True:
            username = raw_input()
            password = raw_input()
            user = authenticate(username, password)
            if user == 'guest':
                serve_guest()
            elif user == 'admin':
                serve_admin()

    def serve_guest():
        global highest_bid
        print "Enter your bid:"
        n = int(raw_input())
        if n > highest_bid:     #
            highest_bid = n     #
        print "Thank you"

    def serve_admin():
        print "Highest big is:", highest_bid

The goal is to make this program more secure by declaring and enforcing
the following properties: first, the guest code is allowed to manipulate
the highest_bid, as in the lines marked with ``#``, but these lines must
not leak back the highest_bid in a form visible to the guest user;
second, the printing in serve_admin() must only be allowed if the user
that logged in is really the administrator (e.g. catch bugs like
accidentally swapping the serve_guest() and serve_admin() calls in
mainloop()).


Preventing leak of information in guest code: 1st try
-----------------------------------------------------

The basic technique to prevent leaks is to attach "confidentiality
level" tags to objects.  In this example, the highest_bid int object
would be tagged with label="secret", e.g. by being initialized as::

    highest_bid = tag(0, label="secret")

At first, we can think about an object space where all objects have such
a label, and the label propagates to operations between objects: for
example, code like ``highest_bid += 1`` would produce a new int object
with again label="secret".

Where this approach doesn't work is with if/else or loops.  In the above
example, we do::

        if n > highest_bid:
            ...

However, by the object space rules introduced above, the result of the
comparison is a "secret" bool objects.  This means that the guest code
cannot know if it is True or False, and so the PyPy interpreter has no
clue if it must following the ``then`` or ``else`` branch of the ``if``.
So the guest code could do ``highest_bid += 1`` and probably even
``highest_bid = max(highest_bid, n)`` if max() is a clever enough
built-in function, but clearly this approach doesn't work well for more
complicated computations that we would like to perform at this point.

There might be very cool possible ideas to solve this with doing some
kind of just-in-time flow object space analysis.  However, here is a
possibly more practical approach.  Let's forget about the object space
tricks and start again.  (See `Related work`_ for why the object space
approach doesn't work too well.)


Preventing leak of information in guest code with the annotator instead
-----------------------------------------------------------------------

Suppose that the program runs on top of CPython and not necessarily
PyPy.  We will only need PyPy's annotator.  The idea is to mark the code
that manipulates highest_bid explicitly, and make it RPython in the
sense that we can take its flow space and follow the calls (we don't
care about the precise types here -- we will use different annotations).
Note that only the bits that manipulates the secret values needs to be
RPython.  Example::

    # on top of CPython, 'hidden' is a type that hides a value without
    # giving any way to normal programs to access it, so the program
    # cannot do anything with 'highest_bid'

    highest_bid = hidden(0, label="secure")

    def enter_bid(n):
        if n > highest_bid.value:
            highest_bid.value = n

    enter_bid = secure(enter_bid)

    def serve_guest():
        print "Enter your bid:"
        n = int(raw_input())
        enter_bid(n)
        print "Thank you"

The point is that the expression ``highest_bid.value`` raises a
SecurityException when run normally: it is not allowed to read this
value.  The secure() decorator uses the annotator on the enter_bid()
function, with special annotations that I will describe shortly.  Then
secure() returns a "compiled" version of enter_bid.  The compiled
version is checked to satisfy the security constrains, and it contains
special code that then enables the ``highest_bid.value`` to work.

The annotations propagated by secure() are ``SomeSecurityLevel``
annotations.  Normal constants are propagated as
SomeSecurityLevel("public").  The ``highest_bid.value`` returns the
annotation SomeSecurityLevel("secret"), which is the label of the
constant ``highest_bid`` hidden object.  We define operations between
two SomeSecurityLevels to return a SomeSecurityLevel which is the max of
the secret levels of the operands.

The key point is that secure() checks that the return value is
SomeSecurityLevel("public").  It also checks that only
SomeSecurityLevel("public") values are stored e.g. in global data
structures.

In this way, any CPython code like serve_guest() can safely call
``enter_bid(n)``.  There is no way to leak information about the current
highest bid back out of the compiled enter_bid().


Declassification
----------------

Now there must be a controlled way to leak the highest_bid value,
otherwise it is impossible even for the admin to read it.  Note that
serve_admin(), which prints highest_bid, is considered to "leak" this
value because it is an input-output, i.e. it escapes the program.  This
is a leak that we actually want -- the terminology is that serve_admin()
must "declassify" the value.

To do this, there is a capability-like model that is easy to implement
for us.  Let us modify the main loop as follows::

    def mainloop():
        while True:
            username = raw_input()
            password = raw_input()
            user, priviledge_token = authenticate(username, password)
            if user == 'guest':
                serve_guest()
            elif user == 'admin':
                serve_admin(priviledge_token)
            del priviledge_token   # make sure nobody else uses it

The idea is that the authenticate() function (shown later) also returns
a "token" object.  This is a normal Python object, but it should not be
possible for normal Python code to instantiate such an object manually.
In this example, authenticate() returns a ``priviledge("public")`` for
guests, and a ``priviledge("secret")`` for admins.  Now -- and this is
the insecure part of this scheme, but it is relatively easy to control
-- the programmer must make sure that these priviledge_token objects
don't go to unexpected places, particularly the "secret" one.  They work
like capabilities: having a reference to them allows parts of the
program to see secret information, of a confidentiality level up to the
one corresponding to the token.

Now we modify serve_admin() as follows:

    def serve_admin(token):
        print "Highest big is:", declassify(highest_bid, token=token)

The declassify() function reads the value if the "token" is priviledged
enough, and raises an exception otherwise.

What are we protecting here?  The fact that we need the administrator
token in order to see the highest bid.  If by mistake we swap the
serve_guest() and serve_admin() lines in mainloop(), then what occurs is
that serve_admin() would be called with the guest token.  Then
declassify() would fail.  If we assume that authenticate() is not buggy,
then the rest of the program is safe from leak bugs.

There are another variants of declassify() that are convenient.  For
example, in the RPython parts of the code, declassify() can be used to
control more precisely at which confidentiality levels we want which
values, if there are more than just two such levels.  The "token"
argument could also be implicit in RPython parts, meaning "use the
current level"; normal non-RPython code always runs at "public" level,
but RPython functions could run with higher current levels, e.g. if they
are called with a "token=..." argument.

(Do not confuse this with what enter_bid() does: enter_bid() runs at the
public level all along.  It is ok for it to compute with, and even
modify, the highest_bid.value.  The point of enter_bid() was that by
being an RPython function the annotator can make sure that the value, or
even anything that gives a hint about the value, cannot possibly escape
from the function.)

It is also useful to have "globally trusted" administrator-level RPython
functions that always run at a higher level than the caller, a bit like
Unix programs with the "suid" bit.  If we set aside the consideration
that it should not be possible to make new "suid" functions too easily,
then we could define the authenticate() function of our server example
as follows::

    def authenticate(username, password):
        database = {('guest', 'abc'): priviledge("public"),
                    ('admin', '123'): priviledge("secret")}
        token_obj = database[username, password]
        return username, declassify(token_obj, target_level="public")

    authenticate = secure(authenticate, suid="secret")

The "suid" argument makes the compiled function run on level "secret"
even if the caller is "public" or plain CPython code.  The declassify()
in the function is allowed because of the current level of "secret".
Note that the function returns a "public" tuple -- the username is
public, and the token_obj is declassified to public.  This is the
property that allows CPython code to call it.

Of course, like a Unix suid program the authenticate() function could be
buggy and leak information, but like suid programs it is small enough
for us to feel that it is secure just by staring at the code.

An alternative to the suid approach is to play with closures, e.g.::

    def setup():
        #initialize new levels -- this cannot be used to access existing levels
        public_level = create_new_priviledge("public")
        secret_level = create_new_priviledge("secret")

        database = {('guest', 'abc'): public_level,
                    ('admin', '123'): secret_level}

        def authenticate(username, password):
            token_obj = database[username, password]
            return username, declassify(token_obj, target_level="public",
                                                   token=secret_level)

        return secure(authenticate)

    authenticate = setup()

In this approach, declassify() works because it has access to the
secret_level token.  We still need to make authenticate() a secure()
compiled function to hide the database and the secret_level more
carefully; otherwise, code could accidentally find them by inspecting
the traceback of the KeyError exception if the username or password is
invalid.  Also, secure() will check for us that authenticate() indeed
returns a "public" tuple.

This basic model is easy to extend in various directions.  For example
secure() RPython functions should be allowed to return non-public
results -- but then they have to be called either with an appropriate
"token=..."  keyword, or else they return hidden objects again.  They
could also be used directly from other RPython functions, in which the
level of what they return is propagated.


Related work
------------

What I'm describing here is nothing more than an adaptation of existing
techniques to RPython.

It is noteworthy to mention at this point why the object space approach
doesn't work as well as we could first expect.  The distinction between
static checking and dynamic checking (with labels only attached to
values) seems to be well known; also, it seems to be well known that the
latter is too coarse in practice.  The problem is about branching and
looping.  From the object space' point of view it is quite hard to know
what a newly computed value really depends on.  Basically, it is
difficult to do better than: after is_true() has been called on a secret
object, then we must assume that all objects created are also secret
because they could depend in some way on the truth-value of the previous
secret object.

The idea to dynamically use static analysis is the key new idea
presented by Steve Zdancewic in his talk.  You can have small controlled
RPython parts of the program that must pass through a static analysis,
and we only need to check dynamically that some input conditions are
satisfied when other parts of the program call the RPython parts.
Previous research was mostly about designing languages that are
completely statically checked at compile-time.  The delicate part is to
get the static/dynamic mixture right so that even indirect leaks are not
possible -- e.g. leaks that would occur from calling functions with
strange arguments to provoke exceptions, and where the presence of the
exception or not would be information in itself.  This approach seems to
do that reliably.  (Of course, at the talk many people including the
speaker were wondering about ways to move more of the checking at
compile-time, but Python people won't have such worries :-)
