From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF24CC433DF for ; Sun, 31 May 2020 18:57:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 81F1C207BB for ; Sun, 31 May 2020 18:57:17 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="GXGqus42" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 81F1C207BB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1095180007; Sun, 31 May 2020 14:57:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0B9178E0003; Sun, 31 May 2020 14:57:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F112880007; Sun, 31 May 2020 14:57:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0026.hostedemail.com [216.40.44.26]) by kanga.kvack.org (Postfix) with ESMTP id D8FB88E0003 for ; Sun, 31 May 2020 14:57:16 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 99C5F8248068 for ; Sun, 31 May 2020 18:57:16 +0000 (UTC) X-FDA: 76877922072.02.lace07_3f762a9f39d42 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin02.hostedemail.com (Postfix) with ESMTP id 740DC4DBD for ; Sun, 31 May 2020 18:57:16 +0000 (UTC) X-HE-Tag: lace07_3f762a9f39d42 X-Filterd-Recvd-Size: 5497 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf03.hostedemail.com (Postfix) with ESMTP for ; Sun, 31 May 2020 18:57:15 +0000 (UTC) Received: from mail-wm1-f48.google.com (mail-wm1-f48.google.com [209.85.128.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id D558D207C4 for ; Sun, 31 May 2020 18:57:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1590951435; bh=lWjRJ6TmC63MtaoV2GDVh+SjpgJni5oloXOqPMTiCTM=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=GXGqus42D/IX98h5Y9NSziW167DU6f1rk6G/hdk4fuhamamfQMSYqxWYXrllbpnG0 OJOvC4VVuhkHffUfGc2sOHN/jsvEi8/KOJw9RWHwc9NYrPrP+E/Or6GmtZAatnAb25 YmSAtaITjOSTKzNtGDvkV9D8wmNKowk0aNt4nCeQ= Received: by mail-wm1-f48.google.com with SMTP id v19so8733225wmj.0 for ; Sun, 31 May 2020 11:57:14 -0700 (PDT) X-Gm-Message-State: AOAM5301alcERPnmDRMPfpPOG/eNHp0HplPIae7QQ+mlC3ItzsUTAUyx 4s5n7P4XiMH1EVLtMWr231nr3VR16yH5qvRkQv0tIg== X-Google-Smtp-Source: ABdhPJyFfciPGCD0rEun0QTBX6ia4qRJhBtnixXTPoOLtfpeWX//2K8XUbDFqxB2Dfyy3eZDH5rD1Rl6irSGixSa4RQ= X-Received: by 2002:a1c:7f96:: with SMTP id a144mr18038816wmd.176.1590951433390; Sun, 31 May 2020 11:57:13 -0700 (PDT) MIME-Version: 1.0 References: <85367hkl06.fsf@collabora.com> <079539BF-F301-47BA-AEAD-AED23275FEA1@amacapital.net> <50a9e680-6be1-ff50-5c82-1bf54c7484a9@gmail.com> In-Reply-To: From: Andy Lutomirski Date: Sun, 31 May 2020 11:57:02 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas To: Paul Gofman Cc: Andy Lutomirski , Gabriel Krisman Bertazi , Linux-MM , LKML , kernel@collabora.com, Thomas Gleixner , Kees Cook , Will Drewry , "H . Peter Anvin" , Zebediah Figura Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 740DC4DBD X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, May 31, 2020 at 11:36 AM Paul Gofman wrote: > > On 5/31/20 21:10, Andy Lutomirski wrote: > > > > That's not what I meant. I meant that you would set the kernel up to > > redirect *all* syscalls from the thread with the sole exception of one > > syscall instruction in the thunk. This would catch Windows syscalls > > and Linux syscalls. The thunk would determine whether the original > > syscall was Linux or Windows and handle it accordingly. > > > > This may interact poorly with the DRM scheme. The redzone might need > > to be respected, or stack switching might be needed. > > Oh yeah, I see now, thanks. Sure, we could trap every syscall and have a > Seccomp-allowed trampoline for executing native ones with the existing > Seccomp implementation. But this is going to have prohibitive > performance impact. Our present use case specifics is that vast majority > of syscalls do not need to be emulated, they are native. And just a few > go from the Windows application which we need to trap and route to our > handler to let the program continue, while we do not care too much about > the overhead for those few. So the hope was that the kernel can route > that majority of Linux native syscalls inside with the minor overhead. > I've read the suggestion to use SECCOMP_RET_USER_NOTIF instead of > SECCOMP_RET_TRAP, is handling the trap this way supposed to be much > quicker than handling the sigsys from SECCOMP_RET_TRAP? More > specifically, would not SECCOMP_RET_USER_NOTIF effectively serialize all > the syscalls waiting in a single queue for processing, while > SECCOMP_RET_TRAP can be processed without exclusive locking? > > Using SECCOMP_RET_USER_NOTIF is likely to be considerably more expensive than my scheme. On a non-PTI system, my approach will add a few tens of ns to each syscall. On a PTI system, it will be worse. But using any kind of notifier for all syscalls will cause a context switch to a different user program for each syscall, and that will be much slower. I think that the implementation may well want to live in seccomp, but doing this as a seccomp filter isn't quite right. It's not a security thing -- it's an emulation thing. Seccomp is all about making inescapable sandboxes, but that's not what you're doing at all, and the fact that seccomp filters are preserved across execve() sounds like it'll be annoying for you. What if there was a special filter type that ran a BPF program on each syscall, and the program was allowed to access user memory to make its decisions, e.g. to look at some list of memory addresses. But this would explicitly *not* be a security feature -- execve() would remove the filter, and the filter's outcome would be one of redirecting execution or allowing the syscall. If the "allow" outcome occurs, then regular seccomp filters run. Obviously the exact semantics here would need some care.