From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=jayZ=7S=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B27C0C433E0
	for <linux-mm@archiver.kernel.org>; Fri,  5 Jun 2020 06:06:49 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 48CAD20814
	for <linux-mm@archiver.kernel.org>; Fri,  5 Jun 2020 06:06:49 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=sargun.me header.i=@sargun.me header.b="vnwXky2n"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 48CAD20814
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sargun.me
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 6E03480007; Fri,  5 Jun 2020 02:06:48 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 690D28E0006; Fri,  5 Jun 2020 02:06:48 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 57EFF80007; Fri,  5 Jun 2020 02:06:48 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0069.hostedemail.com [216.40.44.69])
	by kanga.kvack.org (Postfix) with ESMTP id 400E98E0006
	for <linux-mm@kvack.org>; Fri,  5 Jun 2020 02:06:48 -0400 (EDT)
Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id EF387180AD801
	for <linux-mm@kvack.org>; Fri,  5 Jun 2020 06:06:47 +0000 (UTC)
X-FDA: 76894124454.14.light73_5d185c126d9d
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin14.hostedemail.com (Postfix) with ESMTP id D95F618229837
	for <linux-mm@kvack.org>; Fri,  5 Jun 2020 06:06:47 +0000 (UTC)
X-HE-Tag: light73_5d185c126d9d
X-Filterd-Recvd-Size: 8258
Received: from mail-ed1-f68.google.com (mail-ed1-f68.google.com [209.85.208.68])
	by imf28.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri,  5 Jun 2020 06:06:47 +0000 (UTC)
Received: by mail-ed1-f68.google.com with SMTP id g9so6488098edw.10
        for <linux-mm@kvack.org>; Thu, 04 Jun 2020 23:06:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=sargun.me; s=google;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=I0e6wuwvfc4O/N1ElVRQyamkuRzorrjrFf6chH6dVdA=;
        b=vnwXky2nNWr8II/vtU4PUoe2WZ/P3/k0fZi7aBpDk2Jb+pHsdmlXaCnQfm7NMC3yic
         8VQNaneB52KzN4+Bq4v9EWyFtkOCzfPrRqwJ9asTZD820tGbr24FEruYzUTTeICrPgLH
         OZl4YP6/atwTjmDIRs1p108dwWRN9f5kEkGYg=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=I0e6wuwvfc4O/N1ElVRQyamkuRzorrjrFf6chH6dVdA=;
        b=k1wnnPwW6Vphhs6XR161FSUNXR1BZspU8t+hDSGWHyucN3Jp1nK8vetm3CPuOWlC9A
         E5wpXim70BzHB35iKRs3MzW2Gc0loLsiXplx3T1Mz3m8rVPmnvWRZmc9KYvRonIC25HU
         +LdIiK/uRgiYJlmpr6ktJzFdYxFm0S0LDvg3m/o1SAUhkhsdHiOcS2X2lt+K4ze8zCnh
         FB1l1WD/+bG3NgcSs2VzC03b/jYlvePue7erORFsHtWp9ZXAB4mr+fAQnfMDsLMVkpJz
         DpCgXSTUGLjUMUVVTwZA9f8FCBPjJJucH1BBmvnh4JnjBTDQjzMclidszl6HMtkSJben
         71qQ==
X-Gm-Message-State: AOAM530kz1qQgHU2lDukfgS133TEhScvsNbIdzSpGs3Ur7oeDGtds2cS
	NT34oQfsWvypKrtXjnx0CUqjvA1NVEyIKz8BIVkQTA==
X-Google-Smtp-Source: ABdhPJxMC1d+LqzxSwRL/ssVtqfJGiWPpfH3U2HfyrhUbz080LI+O9WCidUCNyhHUPeYJdIc/2u+DmlTlbZGQPQrYpE=
X-Received: by 2002:a05:6402:b37:: with SMTP id bo23mr7937529edb.24.1591337205697;
 Thu, 04 Jun 2020 23:06:45 -0700 (PDT)
MIME-Version: 1.0
References: <20200530055953.817666-1-krisman@collabora.com>
In-Reply-To: <20200530055953.817666-1-krisman@collabora.com>
From: Sargun Dhillon <sargun@sargun.me>
Date: Thu, 4 Jun 2020 23:06:09 -0700
Message-ID: <CAMp4zn--RbHeViLOmRi4USE7hwTNhVqASJJJeXjCkOah5R4-0A@mail.gmail.com>
Subject: Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
To: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>, kernel@collabora.com, 
	Thomas Gleixner <tglx@linutronix.de>, Kees Cook <keescook@chromium.org>, 
	Andy Lutomirski <luto@amacapital.net>, Will Drewry <wad@chromium.org>, 
	"H . Peter Anvin" <hpa@zytor.com>, Paul Gofman <gofmanp@gmail.com>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: D95F618229837
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam01
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, May 29, 2020 at 11:01 PM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Modern Windows applications are executing system call instructions
> directly from the application's code without going through the WinAPI.
> This breaks Wine emulation, because it doesn't have a chance to
> intercept and emulate these syscalls before they are submitted to Linux.
>
> In addition, we cannot simply trap every system call of the application
> to userspace using PTRACE_SYSEMU, because performance would suffer,
> since our main use case is to run Windows games over Linux.  Therefore,
> we need some in-kernel filtering to decide whether the syscall was
> issued by the wine code or by the windows application.
>
> The filtering cannot really be done based solely on the syscall number,
> because those could collide with existing Linux syscalls.  Instead, our
> proposed solution is to trap syscalls based on the userspace memory
> region that triggered the syscall, as wine is responsible for the
> Windows code allocations and it can apply correct memory protections to
> those areas.
>
> Therefore, this patch reuses the seccomp infrastructure to trap
> system calls, but introduces a new mode to trap based on a vma attribute
> that describes whether the userspace memory region is allowed to execute
> syscalls or not.  The protection is defined at mmap/mprotect time with a
> new protection flag PROT_NOSYSCALL.  This setting only takes effect if
> the new SECCOMP_MODE_MEMMAP is enabled through seccomp().
>
> It goes without saying that this is in no way a security mechanism
> despite being built on top of seccomp, since an evil application can
> always jump to a whitelisted memory region and run the syscall.  This
> is not a concern for Wine games.  Nevertheless, we reuse seccomp as a
> way to avoid adding a new mechanism to essentially do the same job of
> filtering system calls.
>
> * Why not SECCOMP_MODE_FILTER?
>
> We experimented with dynamically generating BPF filters for whitelisted
> memory regions and using SECCOMP_MODE_FILTER, but there are a few
> reasons why it isn't enough nor a good idea for our use case:
>
> 1. We cannot set the filters at program initialization time and forget
> about it, since there is no way of knowing which modules will be loaded,
> whether native and windows.  Filter would need a way to be updated
> frequently during game execution.
>
> 2. We cannot predict which Linux libraries will issue syscalls directly.
> Most of the time, whitelisting libc and a few other libraries is enough,
> but there are no guarantees other Linux libraries won't issue syscalls
> directly and break the execution.  Adding every linux library that is
> loaded also has a large performance cost due to the large resulting
> filter.
>
> 3. As I mentioned before, performance is critical.  In our testing with
> just a single memory segment blacklisted/whitelisted, the minimum size
> of a bpf filter would be 4 instructions.  In that scenario,
> SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
> time of sysinfo(2) in comparison to seccomp disabled, while the impact
> of SECCOMP_MODE_MEMMAP was averaged around 1.5%.
>
> Indeed, points 1 and 2 could be worked around with some userspace work
> and improved SECCOMP_MODE_FILTER support, but at a high performance and
> some stability cost, to obtain the semantics we want.  Still, the
> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
> enough that I believe it should be considered as an upstream solution.
>
> Sending as an RFC for now to get the discussion started.  In particular:
I have a totally different question. I am experimenting with a
patchset which is designed
to help with the "extended syscall" case (as Kees calls it).
Effectively syscalls like openat2,
where the syscall arguments are passed as a (potentially mixed size)
structure need to be
able to be inspected through user notif. `We can kind-of deal with
this with other syscalls
with mechanisms like pidfd_getfd, addfd, and potentially being able to
(re)set the registers
prior to actual invocation of the syscall. Unfortunately, you cannot
do the same trick with
user memory, because it opens you up to a time-of-check, time-of-use
attack, since the
kernel copies the syscall arguments from the invoking program again.

One of the things I've been experimenting with is using tricks like
userfaultfd / mprotect to
try to deal with this. I think that I might have to add some
capability to the kernel to actually
deal with this. In general, the approach is:
1. Syscall is invoked, and wakes up the manager
2. The manager gets the arguments, and a handle (either the ID, or an
FD). It then uses this
ID to read memory. Either something like process_vm_readv, an ioctl, or read.
3. When the kernel reads these arguments, it splits the VMA for the
address the pointer
lies in, and sets up access() with a special mapping that checks if
the page has been
tampered with by userspace in the read ranges between the manager read
and the writes.
We can either SIGBUS or stall writes to the range if we want to make
things "simple",
or we can mess with uaccess bits and EPERM if the kernel tries to read
that memory.
4. When the syscall returns, or the kernel writes to that area, we
reset the mapping.

I'm wondering if you're dynamically generating these special mappings
with protection,
and how many of them you're generating. How often are you generating them? What
kind of performance cost do you see in normal programs?