From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B27C0C433E0 for ; Fri, 5 Jun 2020 06:06:49 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 48CAD20814 for ; Fri, 5 Jun 2020 06:06:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=sargun.me header.i=@sargun.me header.b="vnwXky2n" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 48CAD20814 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sargun.me Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6E03480007; Fri, 5 Jun 2020 02:06:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 690D28E0006; Fri, 5 Jun 2020 02:06:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 57EFF80007; Fri, 5 Jun 2020 02:06:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0069.hostedemail.com [216.40.44.69]) by kanga.kvack.org (Postfix) with ESMTP id 400E98E0006 for ; Fri, 5 Jun 2020 02:06:48 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id EF387180AD801 for ; Fri, 5 Jun 2020 06:06:47 +0000 (UTC) X-FDA: 76894124454.14.light73_5d185c126d9d Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin14.hostedemail.com (Postfix) with ESMTP id D95F618229837 for ; Fri, 5 Jun 2020 06:06:47 +0000 (UTC) X-HE-Tag: light73_5d185c126d9d X-Filterd-Recvd-Size: 8258 Received: from mail-ed1-f68.google.com (mail-ed1-f68.google.com [209.85.208.68]) by imf28.hostedemail.com (Postfix) with ESMTP for ; Fri, 5 Jun 2020 06:06:47 +0000 (UTC) Received: by mail-ed1-f68.google.com with SMTP id g9so6488098edw.10 for ; Thu, 04 Jun 2020 23:06:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sargun.me; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=I0e6wuwvfc4O/N1ElVRQyamkuRzorrjrFf6chH6dVdA=; b=vnwXky2nNWr8II/vtU4PUoe2WZ/P3/k0fZi7aBpDk2Jb+pHsdmlXaCnQfm7NMC3yic 8VQNaneB52KzN4+Bq4v9EWyFtkOCzfPrRqwJ9asTZD820tGbr24FEruYzUTTeICrPgLH OZl4YP6/atwTjmDIRs1p108dwWRN9f5kEkGYg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=I0e6wuwvfc4O/N1ElVRQyamkuRzorrjrFf6chH6dVdA=; b=k1wnnPwW6Vphhs6XR161FSUNXR1BZspU8t+hDSGWHyucN3Jp1nK8vetm3CPuOWlC9A E5wpXim70BzHB35iKRs3MzW2Gc0loLsiXplx3T1Mz3m8rVPmnvWRZmc9KYvRonIC25HU +LdIiK/uRgiYJlmpr6ktJzFdYxFm0S0LDvg3m/o1SAUhkhsdHiOcS2X2lt+K4ze8zCnh FB1l1WD/+bG3NgcSs2VzC03b/jYlvePue7erORFsHtWp9ZXAB4mr+fAQnfMDsLMVkpJz DpCgXSTUGLjUMUVVTwZA9f8FCBPjJJucH1BBmvnh4JnjBTDQjzMclidszl6HMtkSJben 71qQ== X-Gm-Message-State: AOAM530kz1qQgHU2lDukfgS133TEhScvsNbIdzSpGs3Ur7oeDGtds2cS NT34oQfsWvypKrtXjnx0CUqjvA1NVEyIKz8BIVkQTA== X-Google-Smtp-Source: ABdhPJxMC1d+LqzxSwRL/ssVtqfJGiWPpfH3U2HfyrhUbz080LI+O9WCidUCNyhHUPeYJdIc/2u+DmlTlbZGQPQrYpE= X-Received: by 2002:a05:6402:b37:: with SMTP id bo23mr7937529edb.24.1591337205697; Thu, 04 Jun 2020 23:06:45 -0700 (PDT) MIME-Version: 1.0 References: <20200530055953.817666-1-krisman@collabora.com> In-Reply-To: <20200530055953.817666-1-krisman@collabora.com> From: Sargun Dhillon Date: Thu, 4 Jun 2020 23:06:09 -0700 Message-ID: Subject: Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas To: Gabriel Krisman Bertazi Cc: linux-mm@kvack.org, LKML , kernel@collabora.com, Thomas Gleixner , Kees Cook , Andy Lutomirski , Will Drewry , "H . Peter Anvin" , Paul Gofman Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: D95F618229837 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, May 29, 2020 at 11:01 PM Gabriel Krisman Bertazi wrote: > > Modern Windows applications are executing system call instructions > directly from the application's code without going through the WinAPI. > This breaks Wine emulation, because it doesn't have a chance to > intercept and emulate these syscalls before they are submitted to Linux. > > In addition, we cannot simply trap every system call of the application > to userspace using PTRACE_SYSEMU, because performance would suffer, > since our main use case is to run Windows games over Linux. Therefore, > we need some in-kernel filtering to decide whether the syscall was > issued by the wine code or by the windows application. > > The filtering cannot really be done based solely on the syscall number, > because those could collide with existing Linux syscalls. Instead, our > proposed solution is to trap syscalls based on the userspace memory > region that triggered the syscall, as wine is responsible for the > Windows code allocations and it can apply correct memory protections to > those areas. > > Therefore, this patch reuses the seccomp infrastructure to trap > system calls, but introduces a new mode to trap based on a vma attribute > that describes whether the userspace memory region is allowed to execute > syscalls or not. The protection is defined at mmap/mprotect time with a > new protection flag PROT_NOSYSCALL. This setting only takes effect if > the new SECCOMP_MODE_MEMMAP is enabled through seccomp(). > > It goes without saying that this is in no way a security mechanism > despite being built on top of seccomp, since an evil application can > always jump to a whitelisted memory region and run the syscall. This > is not a concern for Wine games. Nevertheless, we reuse seccomp as a > way to avoid adding a new mechanism to essentially do the same job of > filtering system calls. > > * Why not SECCOMP_MODE_FILTER? > > We experimented with dynamically generating BPF filters for whitelisted > memory regions and using SECCOMP_MODE_FILTER, but there are a few > reasons why it isn't enough nor a good idea for our use case: > > 1. We cannot set the filters at program initialization time and forget > about it, since there is no way of knowing which modules will be loaded, > whether native and windows. Filter would need a way to be updated > frequently during game execution. > > 2. We cannot predict which Linux libraries will issue syscalls directly. > Most of the time, whitelisting libc and a few other libraries is enough, > but there are no guarantees other Linux libraries won't issue syscalls > directly and break the execution. Adding every linux library that is > loaded also has a large performance cost due to the large resulting > filter. > > 3. As I mentioned before, performance is critical. In our testing with > just a single memory segment blacklisted/whitelisted, the minimum size > of a bpf filter would be 4 instructions. In that scenario, > SECCOMP_MODE_FILTER added an average overhead of 10% to the execution > time of sysinfo(2) in comparison to seccomp disabled, while the impact > of SECCOMP_MODE_MEMMAP was averaged around 1.5%. > > Indeed, points 1 and 2 could be worked around with some userspace work > and improved SECCOMP_MODE_FILTER support, but at a high performance and > some stability cost, to obtain the semantics we want. Still, the > performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive > enough that I believe it should be considered as an upstream solution. > > Sending as an RFC for now to get the discussion started. In particular: I have a totally different question. I am experimenting with a patchset which is designed to help with the "extended syscall" case (as Kees calls it). Effectively syscalls like openat2, where the syscall arguments are passed as a (potentially mixed size) structure need to be able to be inspected through user notif. `We can kind-of deal with this with other syscalls with mechanisms like pidfd_getfd, addfd, and potentially being able to (re)set the registers prior to actual invocation of the syscall. Unfortunately, you cannot do the same trick with user memory, because it opens you up to a time-of-check, time-of-use attack, since the kernel copies the syscall arguments from the invoking program again. One of the things I've been experimenting with is using tricks like userfaultfd / mprotect to try to deal with this. I think that I might have to add some capability to the kernel to actually deal with this. In general, the approach is: 1. Syscall is invoked, and wakes up the manager 2. The manager gets the arguments, and a handle (either the ID, or an FD). It then uses this ID to read memory. Either something like process_vm_readv, an ioctl, or read. 3. When the kernel reads these arguments, it splits the VMA for the address the pointer lies in, and sets up access() with a special mapping that checks if the page has been tampered with by userspace in the read ranges between the manager read and the writes. We can either SIGBUS or stall writes to the range if we want to make things "simple", or we can mess with uaccess bits and EPERM if the kernel tries to read that memory. 4. When the syscall returns, or the kernel writes to that area, we reset the mapping. I'm wondering if you're dynamically generating these special mappings with protection, and how many of them you're generating. How often are you generating them? What kind of performance cost do you see in normal programs?