From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2737AC05027 for ; Mon, 6 Feb 2023 21:02:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B3A456B0075; Mon, 6 Feb 2023 16:02:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AC3376B007D; Mon, 6 Feb 2023 16:02:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9171F6B007E; Mon, 6 Feb 2023 16:02:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 7C1746B0075 for ; Mon, 6 Feb 2023 16:02:50 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 4690A1209AC for ; Mon, 6 Feb 2023 21:02:50 +0000 (UTC) X-FDA: 80438091300.16.8C9E651 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf23.hostedemail.com (Postfix) with ESMTP id 2DE22140028 for ; Mon, 6 Feb 2023 21:02:46 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=hgrzkKRC; spf=pass (imf23.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675717367; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lT/t97DdkIMgb9qhfnXTuR1mlXFAu5r6VUP5tQl8fLs=; b=ur6BWyyA/nqgxEgdKJYXfeQ8Ikji14y0yDNqpwrLjJfmMzWHi0kqBT5qv5IkoNg3RqRO1B e6j5eP2OGWcRR3VAhI46N+pF95/IeLyRxIkPSdLD/l/MBDUtBw/ypp1ImEHsRsqTqlop+S DMA4G8xv/Tv2/RnN6Ae8lX5BVosn8vA= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=hgrzkKRC; spf=pass (imf23.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675717367; a=rsa-sha256; cv=none; b=OYzaHBT1YUGrz68TsuJwQRG5YyG2sPE7zjSqGMhk53M06+6X9UKtvgjpu/8dCPjDZloGMc NVMCvfrtVS6gtUmYk8d+/vWdQ/zdSGzREtY8S4aiqesdjzeleK387rGf2RgtLGIxyJZY39 tbvMbT3jbshHPLbMNxn/BH9oLzPSLMs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675717366; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=lT/t97DdkIMgb9qhfnXTuR1mlXFAu5r6VUP5tQl8fLs=; b=hgrzkKRCqjKlDBszec3hDDnVXoM2k1r6C4aMC0vrT/BdeWiTXgk2ohBYzKxA9XyNE3JrQg 7BAo93YN0rYILd4DtVzLJf23igy+RVM8pmc4SC48G/i2mLSoQUnnCHy+uditVUknCT+7dB fRAuuMblrYNLX8BmM4NxcBNrPGJ8FF8= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-400-BAL23hybN96tgsCkHzjdNQ-1; Mon, 06 Feb 2023 16:02:45 -0500 X-MC-Unique: BAL23hybN96tgsCkHzjdNQ-1 Received: by mail-qv1-f71.google.com with SMTP id c10-20020a05621401ea00b004c72d0e92bcso6466863qvu.12 for ; Mon, 06 Feb 2023 13:02:45 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=lT/t97DdkIMgb9qhfnXTuR1mlXFAu5r6VUP5tQl8fLs=; b=ssiuyLibCOqJzoqvhqEMAYWS21kihC8qOWzWYqqmB7MJ3nadjTJAnFJZgxprbZ32p7 eOLv6rFlZPB+7C/thYf/xCmLutzOpkHn2v7pq0rwL8Vko76WpthSdfoqGEVjFQKX2Of1 oTvSfjYHRlunCGX4eo6JEUUJ+GMgKYJBQt0cnpkCu4ZnzKV9i+KvjYU0c0a3mIanyDCC 9hDbOBfTr3dgsd0DaV7sd9xQlLQ/Ic2eFNPcpta2LJ93xh8NdE+mkp7zFHM1tyA3lg39 0JhNb0hXdUSXDXwi3FQLivUre+bBWWauK90TQrujPzDABdkr/GUm2ytZD07jZpzpGU+u eAJg== X-Gm-Message-State: AO0yUKWrgT4o0xB5C0bykKieg4K3BPH21JE/EqThZjIUiGOTTZk8Vkyn OV24MxdO2NkdVJToPUYrGQm0d3kF/1YOxPnXN7cSR/CD0aujsoRD2e9KJ1Sh+uT049N0xZ9+ndd g48NdwQGD70E= X-Received: by 2002:ac8:4e8d:0:b0:3b8:5f47:aac2 with SMTP id 13-20020ac84e8d000000b003b85f47aac2mr1652535qtp.1.1675717364406; Mon, 06 Feb 2023 13:02:44 -0800 (PST) X-Google-Smtp-Source: AK7set+XKPcRB961EMdtG7Ywry4tipCNZTGwJ0Vf/CYdZcb0JOglrmr1RrBGooj0ym/GazIEeN+1dA== X-Received: by 2002:ac8:4e8d:0:b0:3b8:5f47:aac2 with SMTP id 13-20020ac84e8d000000b003b85f47aac2mr1652499qtp.1.1675717364042; Mon, 06 Feb 2023 13:02:44 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id o11-20020ac87c4b000000b003b9a426d626sm8044886qtv.22.2023.02.06.13.02.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 13:02:43 -0800 (PST) Date: Mon, 6 Feb 2023 16:02:42 -0500 From: Peter Xu To: David Stevens Cc: linux-mm@kvack.org, Andrew Morton , "Kirill A . Shutemov" , Yang Shi , David Hildenbrand , Hugh Dickins , linux-kernel@vger.kernel.org Subject: Re: [PATCH v2] mm/khugepaged: skip shmem with userfaultfd Message-ID: References: <20230206112856.1802547-1-stevensd@google.com> MIME-Version: 1.0 In-Reply-To: <20230206112856.1802547-1-stevensd@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 2DE22140028 X-Stat-Signature: 4hr5w4xq9btntxxd9u3o9bj4tbhskqyn X-HE-Tag: 1675717366-190451 X-HE-Meta: U2FsdGVkX18l9FTkNlogZifrMU+26jSfNSpF1VU205VdAOlmTtfMIBilVwAZ76WOciiH9v9x0ddxPpQtliKN4O9vJhbbmKt4BrcFrDaJGxH6GHWiSb7U42RY9k/EJHa6Epx4ldhbE6plJnxELNiRa+938BcIdYYcHjaYhfPA492Qpqqkwu7Q8pIyz8ZEga0vYvvqhqBiK7h9HaLXl0/AU0yqIr3wVLR/O9DDSOhmraoHtZicpptUiL2MvvrqL1kutFCaRoWEYU3hEHr4vMW45MTBW2DGYuZGq6PckGeAoV+ybc8mo1A7QS0tTHV4KR6kWPKZ+wt3lzSDqFp2IX86pjnJHMw1/d1q3vfepbE2gnj7Shi7AYyDlg/RVpBK+kmKBCR5BzB8P511ykr/Jy0+24lubTlkkTPTikwjRrL2mXOQ5UMa2pGy8e3iT97q9Hu5qDHUhMoxfGk0iY0KBVG+njUm5XkupKwWpxXi5OAnpamvaBnyT7XWOfAqL2hBVAEjUU6TO7+13zMA/GDYqcpBHm7rnpUS0/GE4e48SkZnHMpKJ4abSLZe4Pkh7O5vEbmHUsmRaGRYgYq7V96LqG31fnWfGykm3EG/Tza8ObIMEcxG20sY2NToXga1DDNSBqH0Fh8szwlIdw4kGzQclZjT6ww+j/zGANMIXOjoVvSVFcTwPGSOz06er0In5Ro1e+KAc1J1g1Cr4yR7hJ+a38kCNn8UWSrc9CaKu2vJCW0J8gX8EiR+fZ5/iCln7jq7CEXHyOHgBdOtGoNvyUY8fY7H12ZUmqPV78rAe7M/R8WfabF0WCUMtxFXhJ6CMpq/8lAinyTiHO0sAK2nbfrgMEkMaxQIVQ+OHFFU+S/pv0HbbcaD1xmaCwiDtFBAHiSFv1yoBZhp+quDuGh/iZDiuVWD835yHP9PRQ7WzmMCNtex+N6LcaA0tl+DH3hxr3w2qtdQapNfxK/hycdACMgZoUc KtdxY11t mwnokmLJ6bMcbzV628v7mmSYfG0pCwDAAfOIF/i0PxcYfHIbEgHGOHUvDKO+LJBZsXAyQ2Hi637phLXPszhvRSkXYu/LBsWahSg2dY7v+WP0Eene14Op2tf5uT1mBFsKGIX85/OaOC1qFjsjRdFzCsu+JUGYSsk/pSvAJrTZr8z1pt2qaQqejPfVKU+dw+IpKlTCCotGE8KlyEbXK2DfkjPNCILdsUn6mgrqUaCeV3kNoqibidGj9rOPLJgL5hx48JokyztEaamnf1OOpnLZGMsV0aIiZO22sfqOqBh5ENbMHYFGohHxO5dfVmf/M4mRMaZQ6poH04mWsJyJP3o1JHahZa7LimVZEaJ3+8pvLcsGtMRouH1TxzrOK501uUx+68d1xwP7RER0YXuasXtkzc6yb82E5s4+JSU7WaKs+EgLtI6Ak5kYI/6GYkQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Feb 06, 2023 at 08:28:56PM +0900, David Stevens wrote: > From: David Stevens > > Collapsing memory will result in any empty pages in the target range > being filled by the new THP. If userspace has a userfaultfd registered > with MODE_MISSING, for any page which it knows to be missing after > registering the userfaultfd, it may expect a UFFD_EVENT_PAGEFAULT. > Taking these two facts together, khugepaged needs to take care when > collapsing pages in shmem to make sure it doesn't break the userfaultfd > API. > > This change first makes sure that the intermediate page cache state > during collapse is not visible by moving when gaps are filled to after > the page cache lock is acquired for the final time. This is necessary > because the synchronization provided by locking hpage is insufficient > for functions which operate on the page cache without actually locking > individual pages to examine their content (e.g. shmem_mfill_atomic_pte). > > This refactoring allows us to iterate over i_mmap to check for any VMAs > with userfaultfds and then finalize the collapse if no such VMAs exist, > all while holding the page cache lock. Since no mm locks are held, it is > necessary to add smb_rmb/smb_wmb to ensure that userfaultfd updates to > vm_flags are visible to khugepaged. However, no further locking of > userfaultfd state is necessary. Although new userfaultfds can be > registered concurrently with finalizing the collapse, any missing pages > that are being replaced can no longer be observed by userspace, so there > is no data race. > > This fix is targeted at khugepaged, but the change also applies to > MADV_COLLAPSE. The fact that the intermediate page cache state before > the rollback of a failed collapse can no longer be observed is > technically a userspace-visible change (via at least SEEK_DATA and > SEEK_END), but it is exceedingly unlikely that anything relies on being > able to observe that transient state. > > Signed-off-by: David Stevens > --- > fs/userfaultfd.c | 2 ++ > mm/khugepaged.c | 67 ++++++++++++++++++++++++++++++++++++++++-------- > 2 files changed, 59 insertions(+), 10 deletions(-) > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index cc694846617a..6ddfcff11920 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -114,6 +114,8 @@ static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, > const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP; > > vma->vm_flags = flags; > + /* Pairs with smp_rmb() in khugepaged's collapse_file() */ > + smp_wmb(); Could you help to explain why this is needed? What's the operation to serialize against updating vm_flags? > /* > * For shared mappings, we want to enable writenotify while > * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 79be13133322..97435c226b18 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -55,6 +55,7 @@ enum scan_result { > SCAN_CGROUP_CHARGE_FAIL, > SCAN_TRUNCATED, > SCAN_PAGE_HAS_PRIVATE, > + SCAN_PAGE_FILLED, > }; > > #define CREATE_TRACE_POINTS > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > * - allocate and lock a new huge page; > * - scan page cache replacing old pages with the new one > * + swap/gup in pages if necessary; > - * + fill in gaps; > * + keep old pages around in case rollback is required; > + * - finalize updates to the page cache; > * - if replacing succeeds: > * + copy data over; > * + free old pages; > @@ -1747,6 +1748,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER); > int nr_none = 0, result = SCAN_SUCCEED; > bool is_shmem = shmem_file(file); > + bool i_mmap_locked = false; > int nr = 0; > > VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); > @@ -1780,8 +1782,14 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > /* > * At this point the hpage is locked and not up-to-date. > - * It's safe to insert it into the page cache, because nobody would > - * be able to map it or use it in another way until we unlock it. > + * > + * While iterating, we may drop the page cache lock multiple times. It > + * is safe to replace pages in the page cache with hpage while doing so > + * because nobody is able to map or otherwise access the content of > + * hpage until we unlock it. However, we cannot insert hpage into empty > + * indicies until we know we won't have to drop the page cache lock > + * again, as doing so would let things which only check the presence > + * of pages in the page cache see a state that may yet be rolled back. > */ > > xas_set(&xas, start); > @@ -1802,13 +1810,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > result = SCAN_TRUNCATED; > goto xa_locked; > } > - xas_set(&xas, index); > + xas_set(&xas, index + 1); I failed to figure out why this index needs a shift here... It seems to me it'll ignore the initial empty page cache, is that what the patch wanted? > } > if (!shmem_charge(mapping->host, 1)) { > result = SCAN_FAIL; > goto xa_locked; > } > - xas_store(&xas, hpage); [I raised a question in the other thread on whether it's legal to not populate page cache holes at all. We can keep the discussion there] > nr_none++; > continue; > } > @@ -1967,6 +1974,46 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > put_page(page); > goto xa_unlocked; > } > + > + if (nr_none) { > + struct vm_area_struct *vma; > + int nr_none_check = 0; > + > + xas_unlock_irq(&xas); > + i_mmap_lock_read(mapping); > + i_mmap_locked = true; > + xas_lock_irq(&xas); > + > + xas_set(&xas, start); > + for (index = start; index < end; index++) { > + if (!xas_next(&xas)) > + nr_none_check++; > + } > + > + if (nr_none != nr_none_check) { > + result = SCAN_PAGE_FILLED; > + goto xa_locked; > + } > + > + /* > + * If userspace observed a missing page in a VMA with an armed > + * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for > + * that page, so we need to roll back to avoid suppressing such > + * an event. Any userfaultfds armed after this point will not be > + * able to observe any missing pages, since the page cache is > + * locked until after the collapse is completed. > + * > + * Pairs with smp_wmb() in userfaultfd_set_vm_flags(). > + */ > + smp_rmb(); > + vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) { > + if (userfaultfd_missing(vma)) { > + result = SCAN_EXCEED_NONE_PTE; > + goto xa_locked; > + } > + } > + } Thanks for writting the patch, but I am still confused why this can avoid race conditions of uffd missing mode. I assume UFFDIO_REGISTER is defined as: after UFFDIO_REGISTER ioctl succeeded on a specific shmem VMA, any page faults due to missing page cache on this vma should generate a page fault. If not, it's violating the userfaultfd semantics. Here, I don't see what stops a new vma from registering MISSING mode right away in parallel of collapsing. When registration completes, it means we should report uffd missing messages on the holes that collapse_file() scanned previously. However they'll be filled very possibly later with the thp which means the messages can be lost. Then the issue can still happen, while this patch only makes it very unlikely to happen, or am I wrong? Thanks, > + > nr = thp_nr_pages(hpage); > > if (is_shmem) > @@ -2000,6 +2047,8 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > xas_store(&xas, hpage); > xa_locked: > xas_unlock_irq(&xas); > + if (i_mmap_locked) > + i_mmap_unlock_read(mapping); > xa_unlocked: > > /* > @@ -2065,15 +2114,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > } > > xas_set(&xas, start); > - xas_for_each(&xas, page, end - 1) { > + end = index; > + for (index = start; index < end; index++) { > + xas_next(&xas); > page = list_first_entry_or_null(&pagelist, > struct page, lru); > if (!page || xas.xa_index < page->index) { > - if (!nr_none) > - break; > nr_none--; > - /* Put holes back where they were */ > - xas_store(&xas, NULL); > continue; > } > > -- > 2.39.1.519.gcb327c4b5f-goog > -- Peter Xu