From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.6 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A1B52C433E0 for ; Fri, 3 Jul 2020 11:04:35 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5403120936 for ; Fri, 3 Jul 2020 11:04:35 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Pua0s1Lj" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5403120936 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C88F68D006D; Fri, 3 Jul 2020 07:04:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C3A0E8D0066; Fri, 3 Jul 2020 07:04:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B28878D006D; Fri, 3 Jul 2020 07:04:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0153.hostedemail.com [216.40.44.153]) by kanga.kvack.org (Postfix) with ESMTP id 9ACAC8D0066 for ; Fri, 3 Jul 2020 07:04:34 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 5D1F2181AC9C6 for ; Fri, 3 Jul 2020 11:04:34 +0000 (UTC) X-FDA: 76996481268.08.actor40_210a15026e91 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin08.hostedemail.com (Postfix) with ESMTP id 38A251819E766 for ; Fri, 3 Jul 2020 11:04:34 +0000 (UTC) X-HE-Tag: actor40_210a15026e91 X-Filterd-Recvd-Size: 11751 Received: from mail-lf1-f68.google.com (mail-lf1-f68.google.com [209.85.167.68]) by imf18.hostedemail.com (Postfix) with ESMTP for ; Fri, 3 Jul 2020 11:04:33 +0000 (UTC) Received: by mail-lf1-f68.google.com with SMTP id c11so18212310lfh.8 for ; Fri, 03 Jul 2020 04:04:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=OSzLG/Q5ctkpPGhuaPoxJvjTyfdlXE2F3IbF2Pwkk7E=; b=Pua0s1Lja8sM6i+ugrq9k9QZGR/RJ80SSdCJ/kQ/d6ICO4o7o6CHfNFNSZ0Q4rrKIq smEu/TpYh/5kBXc3s7RbLvo9vdxuEgfJ9fnIoKxHcGvY9MeDojxCKeBm1jw9Lxo1S+uw xEcODlP6Kw5MVjda+EIESAD6qsfRipIGOswibWtuyOa0IXiTEWfb09F1WET2qlW97S1u km5fBU37jUymHPZnsVe77SYwM7yLj+Lj/N2E5xUWMuMAxc2BA/N1GzbxrsPqcd7aC1i1 uhpgIEp9C3qnctr4mj4T1fu3oiZ6FjKSaUKystz+5WdalJgVMlOdRQ3GxgiADnJbj2Pz Pelw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=OSzLG/Q5ctkpPGhuaPoxJvjTyfdlXE2F3IbF2Pwkk7E=; b=jGxqXKWo8+0cwIzASeRQV3roIj7hlMGrdio7R623rRE/LAdtT7QbO0nQ6sKwAbiAhS 1IlvBeN2dOLpSzbUWDiB3kIyezaC0dOvYwLsUqmVuZBZOpUeGRb08KbzZ0P8ZQgvZ+mx ZAjK/siYCWf0Is0DN/E9ot0anRXMXgNvVwijVigHr9P2/rS83dpdF4S4f+vxclwBQiDe 4zJA963p7+p0TloDbMACcbQLWwVJFXWwypC94axra16vp03QN9JR5Ry6EzPrCucxsaBi AhbE+tdqkPXn3L6AciZDSkDzwuGs8dGitB7lBbyzjxlBj8Zw+WL0uMHZvx4TcDKCH84n 8wmA== X-Gm-Message-State: AOAM533iiRh+j3yzj2vJjYV6Ga6mNH4nXzWNfWLlwmLItAIrD9Q/jyhq c7ShZuYsppTAay5vWKpmH4etfog5vqhDCU1PAvtrjA== X-Google-Smtp-Source: ABdhPJwCBAYxeYjK4PxqGeHvUGXev/BGu5CvhhXznpS3i7EMvLEgcVNQSVGxxsASwC6knmsUQUEOB6ytdcgHUZZ9pVM= X-Received: by 2002:a05:6512:3107:: with SMTP id n7mr21993946lfb.63.1593774271705; Fri, 03 Jul 2020 04:04:31 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jann Horn Date: Fri, 3 Jul 2020 13:04:05 +0200 Message-ID: Subject: Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND To: "Catangiu, Adrian Costin" Cc: "linux-mm@kvack.org" , "linux-pm@vger.kernel.org" , "virtualization@lists.linux-foundation.org" , "linux-api@vger.kernel.org" , "akpm@linux-foundation.org" , "rjw@rjwysocki.net" , "len.brown@intel.com" , "pavel@ucw.cz" , "mhocko@kernel.org" , "fweimer@redhat.com" , "keescook@chromium.org" , "luto@amacapital.net" , "wad@chromium.org" , "mingo@kernel.org" , "bonzini@gnu.org" , "Graf (AWS), Alexander" , "MacCarthaigh, Colm" , "Singh, Balbir" , "Sandu, Andrei" , "Brooker, Marc" , "Weiss, Radu" , "Manwaring, Derek" Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 38A251819E766 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jul 3, 2020 at 12:34 PM Catangiu, Adrian Costin wrote: > Cryptographic libraries carry pseudo random number generators to > quickly provide randomness when needed. If such a random pool gets > cloned, secrets may get revealed, as the same random number may get > used multiple times. For fork, this was fixed using the WIPEONFORK > madvise flag [1]. > > Unfortunately, the same problem surfaces when a virtual machine gets > cloned. The existing flag does not help there. This patch introduces a > new flag to automatically clear memory contents on VM suspend/resume, > which will allow random number generators to reseed when virtual > machines get cloned. > > Examples of this are: > - PKCS#11 API reinitialization check (mandated by specification) > - glibc's upcoming PRNG (reseed after wake) > - OpenSSL PRNG (reseed after wake) > > Benefits exist in two spaces: > - The security benefits of a cloned virtual machine having a > re-initialized PRNG in every process are straightforward. > Without reinitialization, two or more cloned VMs could produce > identical random numbers, which are often used to generate secure > keys. > - Provides a simple mechanism to avoid RAM exfiltration during > traditional sleep/hibernate on a laptop or desktop when memory, > and thus secrets, are vulnerable to offline tampering or inspection. For the first usecase, I wonder which way around this would work better - do the wiping when a VM is saved, or do it when the VM is restored? I guess that at least in some scenarios, doing it on restore would be nicer because that way the hypervisor can always instantly save a VM without having to wait for the guest to say "alright, I'm ready" - especially if someone e.g. wants to take a snapshot of a running VM while keeping it running? Or do hypervisors inject such ACPI transitions every time they snapshot/save/restore a VM anyway? > This RFC is foremost aimed at defining a userspace interface to enable > applications and libraries that store or cache sensitive information, > to know that they need to regenerate it after process memory has been > exposed to potential copying. The proposed userspace interface is > a new MADV_WIPEONSUSPEND 'madvise()' flag used to mark pages which > contain such data. This newly added flag would only be available on > 64bit archs, since we've run out of 32bit VMA flags. > > The mechanism through which the kernel marks the application sensitive > data as potentially copied, is a secondary objective of this RFC. In > the current PoC proposal, the RFC kernel code combines > MADV_WIPEONSUSPEND semantics with ACPI suspend/wake transitions to zero > out all process pages that fall in VMAs marked as MADV_WIPEONSUSPEND > and thus allow applications and libraries be notified and regenerate > their sensitive data. Marking VMAs as MADV_WIPEONSUSPEND results in > the VMAs being empty in the process after any suspend/wake cycle. > Similar to MADV_WIPEONFORK, if the process accesses memory that was > wiped on suspend, it will get zeroes. The address ranges are still > valid, they are just empty. > > This patch adds logic to the kernel power code to zero out contents of > all MADV_WIPEONSUSPEND VMAs present in the system during its transition > to any suspend state equal or greater/deeper than Suspend-to-memory, > known as S3. > > MADV_WIPEONSUSPEND only works on private, anonymous mappings. > The patch also adds MADV_KEEPONSUSPEND, to undo the effects of a > prior MADV_WIPEONSUSPEND for a VMA. > > Hypervisors can issue ACPI S0->S3 and S3->S0 events to leverage this > functionality in a virtualized environment. > > Alternative kernel implementation ideas: > - Move the code that clears MADV_WIPEONFORK pages to a virtual > device driver that registers itself to ACPI events. > - Add prerequisite that MADV_WIPEONFORK pages must be pinned (so > no faulting happens) and clear them in a custom/roll-your-own > device driver on a NMI handler. This could work in a virtualized > environment where the hypervisor pauses all other vCPUs before > injecting the NMI. > > [1] https://lore.kernel.org/lkml/20170811212829.29186-1-riel@redhat.com/ [...] > diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c > index c874a7026e24..4282b7f0dd03 100644 > --- a/kernel/power/suspend.c > +++ b/kernel/power/suspend.c > @@ -323,6 +323,78 @@ static bool platform_suspend_again(suspend_state_t state) > suspend_ops->suspend_again() : false; > } > > +#ifdef VM_WIPEONSUSPEND > +static void memory_cleanup_on_suspend(suspend_state_t state) > +{ > + struct task_struct *p; > + struct mm_struct *mm; > + struct vm_area_struct *vma; > + struct page *pages[32]; > + unsigned long max_pages_per_loop = ARRAY_SIZE(pages); > + > + /* Only care about states >= S3 */ > + if (state < PM_SUSPEND_MEM) > + return; > + > + rcu_read_lock(); > + for_each_process(p) { > + int gup_flags = FOLL_WRITE; > + > + mm = p->mm; > + if (!mm) > + continue; > + > + down_read(&mm->mmap_sem); Blocking actions, such as locking semaphores, are forbidden in RCU read-side critical sections. Also, from a more high-level perspective, do we need to be careful here to avoid deadlocks with frozen tasks or stuff like that? > + for (vma = mm->mmap; vma; vma = vma->vm_next) { > + unsigned long addr, nr_pages; > + > + if (!(vma->vm_flags & VM_WIPEONSUSPEND)) > + continue; > + > + addr = vma->vm_start; > + nr_pages = (vma->vm_end - addr - 1) / PAGE_SIZE + 1; > + while (nr_pages) { > + int count = min(nr_pages, max_pages_per_loop); > + void *kaddr; > + > + count = get_user_pages_remote(p, mm, addr, > + count, gup_flags, > + pages, NULL, NULL); get_user_pages_remote() can wait for disk I/O (for swapping stuff back in), which we'd probably like to avoid here. And I think it can also wait for userfaultfd handling from userspace? zap_page_range() (which is what e.g. MADV_DONTNEED uses) might be a better fit, since it can yank entries out of the page table (forcing the next write fault to allocate a new zeroed page) without faulting them into RAM. > + if (count <= 0) { > + /* > + * FIXME: In this PoC just break if we > + * get an error. > + * In the final implementation we need > + * to handle this better and not leave > + * pages uncleared. > + */ > + break; > + } > + /* Go through pages buffer and clear them. */ > + while (count) { > + struct page *page = pages[--count]; > + > + kaddr = kmap(page); > + clear_page(kaddr); > + kunmap(page); (This part should go away, but if it stayed, you'd probably want to use clear_user_highpage() or so instead of open-coding this.) > + put_page(page); > + nr_pages--; > + addr += PAGE_SIZE; > + } > + } > + } > + up_read(&mm->mmap_sem); > + } > + rcu_read_unlock(); > +}