From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1EBA3C74A5B for ; Thu, 23 Mar 2023 19:08:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B935F6B0075; Thu, 23 Mar 2023 15:08:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B43686B0078; Thu, 23 Mar 2023 15:08:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A0B776B007D; Thu, 23 Mar 2023 15:08:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 921C76B0075 for ; Thu, 23 Mar 2023 15:08:08 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 676D11C4795 for ; Thu, 23 Mar 2023 19:08:08 +0000 (UTC) X-FDA: 80601098256.19.4EC4DC1 Received: from mail-qv1-f51.google.com (mail-qv1-f51.google.com [209.85.219.51]) by imf18.hostedemail.com (Postfix) with ESMTP id 8F7D41C0021 for ; Thu, 23 Mar 2023 19:08:05 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=CBJ+lvrg; spf=pass (imf18.hostedemail.com: domain of hughd@google.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679598485; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ivi4IJmDGlMaUcj97G9bPWUboQ3UgQjoh4VHDGaJpfc=; b=JFu6RbZ6lq8UGKmnAHurdmel/jmQwAe9HkIDoNG364nBn9Ge4Uch197WTHN5nVPp1wMco2 XyhkwkargMSZL/LxhVAU5LpRb9tpz+jb9shnC5/c9DCENd4KbWpdqejPqdqGQspY+5Ztb8 7FjVpschiNWZqG6V03e5DteIGIFBEDM= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=CBJ+lvrg; spf=pass (imf18.hostedemail.com: domain of hughd@google.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679598485; a=rsa-sha256; cv=none; b=c7JFX71HL/KwE3n6d77jd3+MmqmJhFShf8RvhdWqESU7+n0Q0jvoCXeUU023kLgMyNsppZ x6ve7ixPzuuxHgmPZHSXRLoRVn4whMWKvG7pFQlt1+JpCgeGcFFpkHTj6AtjsVGbQLKSpu 5/S4c8TkmyZuXeP15sgRPqHe93SJYjM= Received: by mail-qv1-f51.google.com with SMTP id qh28so14833760qvb.7 for ; Thu, 23 Mar 2023 12:08:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1679598484; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=Ivi4IJmDGlMaUcj97G9bPWUboQ3UgQjoh4VHDGaJpfc=; b=CBJ+lvrgdwqXbnRaEFqUSluQmkERU+1Ly+yUdpQSYhoMZ/eZO1RTbRWPJjyDSl0+sE sNs6jBeiFFOukrToDd6JFSZnGxPzD3IJ21lbBgigJMdK3uDWn4AwkKHYdEFB/jp5wZLt VRJK9U8qbzs9kRRa5pO70XpyVsOhL5i1tNrwApaEp2r4KufkMS1QVJDytPS1TapSnE7Y ZwuBNMdxIM+V3k3tO8U6XaCV3bwaoUZNm+ioF8Fa5Y7GfLwIg0QN8oi4bpTkOFAfIhMo m5D5dp7uc9BvDoVYPYMAmL8Kkuk8j/8wvyboRfS0S5pKLfl169iJAtd/zRYz2RCweiQe rwyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679598484; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Ivi4IJmDGlMaUcj97G9bPWUboQ3UgQjoh4VHDGaJpfc=; b=heYNSA9HOJrMd6jhanRNhDBWoAkTfe+e2P004bzo16qKuOHbxOZ5KIuIoDpQJCSKoL 8Un9t84F3RBJdQSC/ll7I2fgn5t54oKgBQgBVoU0MzPq2/B/YVGsUxJHRCUBpRgTytMH /6kbtUNf1PzwSL76EArTtzOXEh/9z3p31UvZyk4nHUWPCrS+zbR2ekIX3aeXx2gCaRxf xHbyVvQQkuqiEUkRfe33ZIlr1AsEbSFouYG4Icgax4chUo9ZVRAYhUg9Nmdq0y8Z6f9m Jo401OkNwvdynMHdMT7Vm/h8QnS/QFwioTzCv6rGOtFwRdHT7Ki8dhfnchF8uGbyjr2i 7lRA== X-Gm-Message-State: AO0yUKXVOW0aW/OoxDBqOttO68LTy54MYVVWEkswVmRQqBTKk+GuhHEs 9E6cTVXTV6e4iZBY0JPynVTXSw== X-Google-Smtp-Source: AK7set+odJ6gjkKr0lacZ1uUOnenCIQGBmh8pAGM48ffyg1NfSPc1f6nZV1Sm8rmtIZVFZQw+NiK1w== X-Received: by 2002:a05:6214:2aaf:b0:5c7:31af:2ea2 with SMTP id js15-20020a0562142aaf00b005c731af2ea2mr15284360qvb.12.1679598484452; Thu, 23 Mar 2023 12:08:04 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id mn15-20020a0562145ecf00b005dd8b93458dsm74978qvb.37.2023.03.23.12.07.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 23 Mar 2023 12:07:56 -0700 (PDT) Date: Thu, 23 Mar 2023 12:07:46 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: David Stevens cc: linux-mm@kvack.org, Andrew Morton , Peter Xu , Matthew Wilcox , "Kirill A . Shutemov" , Yang Shi , David Hildenbrand , Hugh Dickins , Jiaqi Yan , linux-kernel@vger.kernel.org Subject: Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag In-Reply-To: <20230307052036.1520708-4-stevensd@google.com> Message-ID: <866d1a75-d462-563-dfd7-1aa2971a285b@google.com> References: <20230307052036.1520708-1-stevensd@google.com> <20230307052036.1520708-4-stevensd@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Stat-Signature: xyj1b7hmug9smgf9mec1uhbdy3j9xhkk X-Rspam-User: X-Rspamd-Queue-Id: 8F7D41C0021 X-Rspamd-Server: rspam06 X-HE-Tag: 1679598485-78845 X-HE-Meta: U2FsdGVkX1/YED9j18xMNzNDurtJSybVhP/Bw0EltnDsfERGCfAHlvtIlVFMJcdgoQXNUqJ5pajftZhEE3X7JCxdMYcDQR7itgHFWFVvZGH2TtEHbNGrsD1tJyTs3ra2HUboUQd1W1j2ICeOZ1HpWip9ilxbmkJI/cHo/aikOP5LXizB3WT/dOlV8tSFQBBEPWjC/F1DAyHi5yyJiYkWnC+epCXFULBkbobXD0moLAYMLWG/hcQamed+GTteChQp4eQ55hkJYIMc6JB4t3Xg2vpFqjQ9mtpxK5A3e/yVv+5XMFBRN8RJj4eruB0+OVjMxY7dmoS4LX7Ye9FG3/GJevH4ttXKhfulF5J9WDbYfOQRP/cboaM9MOgYReswU32y/0q8WlnZCwvsWNFO7vf4nT9+vu/zbEAfDIljRGCGk52NGQDzxkWN0XN2jVfNyN3hmoGpZW3WRgZM7wttgdMJd71BzT1vJTD4GZF34383RRBkuaxf4MFdAu6FgcP8aU+Fz5V8Zv318yPoY0jmaj7kCXF/XpgiFmwWnyWVNZhmztgS3VrQZpJo6egHSGMh8tMoID5omZIIf9M/n78SNVyytukMPgTcP+7Fv9C6MaqRMhZ5b1N7JWZs5+q6MvJGgV2ooGuLgDfgqSDVbMcL3tC/blfcwdUdLNIBzVRKQtBepg81yqZzE4JorbenMZY5f5TA1gbgc6A8OkToohWYlR8mazyZPyPVktN+KL0BwjT/ji5CP3T+0+g1XkKwFkplvntwyvegI6pnS6kMxQW7maQG3Eh9IQsBHhkqZPgXBcfzFRDdXD1meiwXAPspkV+8Hb3Sa2AItKL/r3CO+kss+56tJk11UbkGSrHWzuketj7uowt58wadY6KzcXqb71BlRpVbDyUUyLWtN/lghsIdvaQGRtJ/Lj9AbZDijxygW6rHHPRi2ib58s3Uy1QZ0ELqVlpPyE4cxjPBrHawb0MMi6K 3KuFdK3c lIQSTKtQj4bqDGHYcWn8HIQ8u3SJaAI5kaFUhcNuRiKy3V7uIt7Jm3Dd6JZ6iwVniHXXhnKp9SqP6uK9B8pdUT2QYlDiBY+DLjbDxEQZyLfdSqyX/kQqM87spaQ5GzR1HL7pbssjNizzjy8NpG2j1LbNMsZDEiyTGJJ1steHl+x1NByt/B8tQPp7ykNPmKmyLLrr1j4t0HRmGhdjt/ynXvG1quj+hCHzFYHv67RowzJNG2FmWGQoxMIAXHZ8d0a0T0ri+kxxdveAFzyGIb3eRC0Uf2WLPIXaICK0T84SNbksTpPsGUbh+tUXSxaM8uQA7e9W4HlPBeFTE8KWDlo8A9ILFrC1uaR8abzwtThQEhLWTwM/sf82+E8nqQuDlc5ewYiV1GzIr6yAc+rT5piP86rBzGUJtpxZnaOVAt4Lhi+qXtXigdiFFHxZ1LAaxSW4nN+tmWC7EMzj7Uw0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, 7 Mar 2023, David Stevens wrote: > From: David Stevens > > Make sure that collapse_file doesn't interfere with checking the > uptodate flag in the page cache by only inserting hpage into the page > cache after it has been updated and marked uptodate. This is achieved by > simply not replacing present pages with hpage when iterating over the > target range. The present pages are already locked, so replacing the > with the locked hpage before the collapse is finalized is unnecessary. > > This fixes a race where folio_seek_hole_data would mistake hpage for > an fallocated but unwritten page. This race is visible to userspace via > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > Signed-off-by: David Stevens > Acked-by: Peter Xu NAK to this patch, I'm afraid: it deadlocks. What I know it to deadlock against, does not make the most persuasive argument: cgroup v1 deprecated memcg moving, where mc_handle_file_pte() uses filemap_get_incore_folio() while holding page table lock, and spins around doing "goto repeat" in filemap_get_entry() while folio_try_get_rcu() keeps failing because collapse_file()'s old page has been left in the xarray with its refcount frozen to 0. Meanwhile, collapse_file() is spinning to get that page table lock, to unmap pte of a later page. mincore()'s use of filemap_get_incore_folio() would be liable to hit the same deadlock. If we think for longer, we may find more examples. But even when not actually deadlocking, it's wasting lots of CPU on concurrent lookups (e.g. faults) spinning in filemap_get_entry(). I don't suppose it's entirely accurate, but think of keeping a folio refcount frozen to 0 as like holding a spinlock (and this lock sadly out of sight from lockdep). The pre-existing code works because the old page with refcount frozen to 0 is immediately replaced in the xarray by an entry for the new hpage, so the old page is no longer discoverable: and the new hpage is locked, not with a spinlock but the usual folio/page lock, on which concurrent lookups will sleep. Your discovery of the SEEK_DATA/SEEK_HOLE issue is important - thank you - but I believe collapse_file() should be left as is, and the fix made instead in mapping_seek_hole_data() or folio_seek_hole_data(): I believe that should not jump to assume that a !uptodate folio is a hole (as was reasonable to assume for shmem, before collapsing to huge got added), but should lock the folio if !uptodate, and check again after getting the lock - if still !uptodate, it's a shmem hole, not a transient race with collapse_file(). I was (pleased but) a little surprised when Matthew found in 5.12 that shmem_seek_hole_data() could be generalized to filemap_seek_hole_data(): he will have a surer grasp of what's safe or unsafe to assume of !uptodate in non-shmem folios. On an earlier audit, for different reasons, I did also run across lib/buildid.c build_id_parse() using find_get_page() without checking PageUptodate() - looks as if it might do the wrong thing if it races with khugepaged collapsing text to huge, and should probably have a similar fix. Hugh > --- > mm/khugepaged.c | 50 ++++++++++++------------------------------------- > 1 file changed, 12 insertions(+), 38 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 51ae399f2035..bdde0a02811b 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1930,12 +1930,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > } > } while (1); > > - /* > - * At this point the hpage is locked and not up-to-date. > - * It's safe to insert it into the page cache, because nobody would > - * be able to map it or use it in another way until we unlock it. > - */ > - > xas_set(&xas, start); > for (index = start; index < end; index++) { > page = xas_next(&xas); > @@ -2104,13 +2098,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > } > > /* > - * Add the page to the list to be able to undo the collapse if > - * something go wrong. > + * Accumulate the pages that are being collapsed. > */ > list_add_tail(&page->lru, &pagelist); > - > - /* Finally, replace with the new page. */ > - xas_store(&xas, hpage); > continue; > out_unlock: > unlock_page(page); > @@ -2149,8 +2139,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > goto rollback; > > /* > - * Replacing old pages with new one has succeeded, now we > - * attempt to copy the contents. > + * The old pages are locked, so they won't change anymore. > */ > index = start; > list_for_each_entry(page, &pagelist, lru) { > @@ -2230,11 +2219,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > /* nr_none is always 0 for non-shmem. */ > __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none); > } > - /* Join all the small entries into a single multi-index entry. */ > - xas_set_order(&xas, start, HPAGE_PMD_ORDER); > - xas_store(&xas, hpage); > - xas_unlock_irq(&xas); > > + /* > + * Mark hpage as uptodate before inserting it into the page cache so > + * that it isn't mistaken for an fallocated but unwritten page. > + */ > folio = page_folio(hpage); > folio_mark_uptodate(folio); > folio_ref_add(folio, HPAGE_PMD_NR - 1); > @@ -2243,6 +2232,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > folio_mark_dirty(folio); > folio_add_lru(folio); > > + /* Join all the small entries into a single multi-index entry. */ > + xas_set_order(&xas, start, HPAGE_PMD_ORDER); > + xas_store(&xas, hpage); > + xas_unlock_irq(&xas); > + > /* > * Remove pte page tables, so we can re-fault the page as huge. > */ > @@ -2267,36 +2261,18 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > rollback: > /* Something went wrong: roll back page cache changes */ > - xas_lock_irq(&xas); > if (nr_none) { > mapping->nrpages -= nr_none; > shmem_uncharge(mapping->host, nr_none); > } > > - xas_set(&xas, start); > - end = index; > - for (index = start; index < end; index++) { > - xas_next(&xas); > - page = list_first_entry_or_null(&pagelist, > - struct page, lru); > - if (!page || xas.xa_index < page->index) { > - nr_none--; > - continue; > - } > - > - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > - > + list_for_each_entry_safe(page, tmp, &pagelist, lru) { > /* Unfreeze the page. */ > list_del(&page->lru); > page_ref_unfreeze(page, 2); > - xas_store(&xas, page); > - xas_pause(&xas); > - xas_unlock_irq(&xas); > unlock_page(page); > putback_lru_page(page); > - xas_lock_irq(&xas); > } > - VM_BUG_ON(nr_none); > /* > * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only. > * This undo is not needed unless failure is due to SCAN_COPY_MC. > @@ -2304,8 +2280,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > if (!is_shmem && result == SCAN_COPY_MC) > filemap_nr_thps_dec(mapping); > > - xas_unlock_irq(&xas); > - > hpage->mapping = NULL; > > unlock_page(hpage); > -- > 2.40.0.rc0.216.gc4246ad0f0-goog