From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B602CC76196 for ; Tue, 28 Mar 2023 09:49:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 420CE6B0072; Tue, 28 Mar 2023 05:49:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3D0A86B0074; Tue, 28 Mar 2023 05:49:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2BF736B0075; Tue, 28 Mar 2023 05:49:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 1AC576B0072 for ; Tue, 28 Mar 2023 05:49:12 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id CCBACAB39E for ; Tue, 28 Mar 2023 09:49:11 +0000 (UTC) X-FDA: 80617833702.17.C91095E Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) by imf05.hostedemail.com (Postfix) with ESMTP id D0418100010 for ; Tue, 28 Mar 2023 09:49:09 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b="glmdP8/q"; spf=pass (imf05.hostedemail.com: domain of stevensd@chromium.org designates 209.85.208.179 as permitted sender) smtp.mailfrom=stevensd@chromium.org; dmarc=pass (policy=none) header.from=chromium.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679996950; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jeVnfIbIbRzhGtBIMQdjLXjegBCIU0AjO884AWik9iA=; b=rzeAMLlht7Wq+sEuv3UtRLoA1QCGOcVreGnOyX5zZyk1yfsnPTASpeyD0dOBQM5rAaoGad /eePtIW/kJgSLanEmqlaDCkY205P2aaU0YxI596UVqxQpqOaHtul5bqJJ2zGM2q00VVRjM Aq9hbX6QLOAf4HiRiE6dI+Hm1umxqgg= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b="glmdP8/q"; spf=pass (imf05.hostedemail.com: domain of stevensd@chromium.org designates 209.85.208.179 as permitted sender) smtp.mailfrom=stevensd@chromium.org; dmarc=pass (policy=none) header.from=chromium.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679996950; a=rsa-sha256; cv=none; b=yV5iwKx9TDaJNRKB9NozgYmXn34LZFnGtllakqIiCQ7hy1Q+aptUBHl+VcORl2RS4RA7Ne GA/AFldXGKTOmrMza7QB83cJLmNFsZ7fMwCtcZVF7FsK2QJ+yWZk8eSAMag5dQzpoYEsf6 I0pXcbLPTwXXNDPQEKtlG8353Q6wyOU= Received: by mail-lj1-f179.google.com with SMTP id e21so11844584ljn.7 for ; Tue, 28 Mar 2023 02:49:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1679996948; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=jeVnfIbIbRzhGtBIMQdjLXjegBCIU0AjO884AWik9iA=; b=glmdP8/q/rlraYsi4uTB1TRFUXNkj7oiIcrDeRkAFbF6AQonA7WKbk/JwEliVp6wKk JXlW2g8FsU2sjBsKy/YAyOdJBJPYuCMs5d7z4rlSxMNwzpm47UPy0SIIa4eVBCOlMNM8 rZV8reLZnDDQXL8KTBbLMtuvAOpnkHQitM47A= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679996948; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=jeVnfIbIbRzhGtBIMQdjLXjegBCIU0AjO884AWik9iA=; b=4e6TwRLZrehEwT2ZRPyE3FAxWYt7r3HCBIEEo7RnsSAXlIo/uGdA/KYEnhJjcs9IOe lz0/zdT/Oj9wa5blIziLoYS0I9vceOPDU82tUCq4ZCD5gf25E1x+gH2Io2AEz9xhjj9i vaJLPWErjU7/n+zL30NG6mIOFlxD3odlRD9L7ZcjybTGODJMp7gRq15OjNJyrQ1YK8Rb iRxXG6JRAO1+Fdf5S09nKUsVQLfYVwSsgHf20RVnI6vQ9hXvweChaAlNiY6fG0HOe/gj w3V20rxu0fY5dXkirRRpA6Vyvj7vj2hpXh5uXuFoqfzfTDw79OfFpQkyKxlpH/OxaJBb Uofg== X-Gm-Message-State: AAQBX9cE2RvR5lc11Y4KCy0QVItUjsHrGzqjUaPC0EAxjHgPklXU95Mt rE6g1dqWFhEdqaAoK0buC7e6Bf4pEMr/Oou6IXV4zg== X-Google-Smtp-Source: AKy350YtFLevXgRr7ONWwCDxOQz47ystRmA1OzvZCfd+fmKFAubLy5wbayNPDjwwi/mAYB158OTGeP2hiR1LKp3WXqs= X-Received: by 2002:a05:651c:115:b0:294:6de5:e642 with SMTP id a21-20020a05651c011500b002946de5e642mr4542131ljb.3.1679996947995; Tue, 28 Mar 2023 02:49:07 -0700 (PDT) MIME-Version: 1.0 References: <20230307052036.1520708-1-stevensd@google.com> <20230307052036.1520708-4-stevensd@google.com> <866d1a75-d462-563-dfd7-1aa2971a285b@google.com> In-Reply-To: <866d1a75-d462-563-dfd7-1aa2971a285b@google.com> From: David Stevens Date: Tue, 28 Mar 2023 18:48:56 +0900 Message-ID: Subject: Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag To: Hugh Dickins Cc: linux-mm@kvack.org, Andrew Morton , Peter Xu , Matthew Wilcox , "Kirill A . Shutemov" , Yang Shi , David Hildenbrand , Jiaqi Yan , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: D0418100010 X-Stat-Signature: bkznxae318mr1sdwjn8uh96ob156pgzt X-Rspam-User: X-HE-Tag: 1679996949-581544 X-HE-Meta: U2FsdGVkX18LMov/IZT5g7WHIg2xPrpdwaCzspyctP8wcxG5d7e6MUOb62HmA/ZyUk/sl2jx2jco9u16b28yo7HX7jz9IXCDGBZQGG1Bky3o9oQafqqwiXJ3cVIZDgU/acWVjrz9HUquDH3z9QZp6CJEeAs1BLDQFfrSISttIyqv8TkLimYhu8p4rFvuJvtG62phGjx5K1XRFsfx7/9z4WtxUdUlQLfsHMA5jfpcpE4Yf0VSPZwY3O0uTpwxZf3kynJKYU+/Yfd6dC0MOlnI/kZYdGrykHD1RTHmtkgiDDrL2bG1D4Vrvp1rWGlPbZbki+42VhcvOgjHh90Tp80FfG+z8XjdjC3rwR1TrSWcFkWYRi/PkDihydhNUQMXyiLYAhRVKH1VF37tUR/mBteDkIcWH8p+OOFKZNe4mW3xRw5dqRItn4v2snLVhph/pem+i3b97CsY0a7dlwz2kofbd6hvR6LAJTFutWi0w9R/WNp7/1/PgHmK/S6SdZ5LfvKWSng2CCMYNdmZHtNiAbhhsZFz+kVb7sj5jIqiv7UMgwWsGuTUiyJn/rm1IecTz84VFrB2ChZ6jcI/xs+Vcf0X4bSPiMH4M1UUV+BKQzPSZ+24IMpAxoCuYSKPBChpQcQOGe7NTSIYC7I1K9o0hjc/2qlBfLInHb8O011bMPJj17dVenj6b9bdmVsyRjyA10iXUpC830ZB2fpgf0Y+9EBpQxW3dCfpXhzZtV4bBcZ3i0IspPJmremywmVXPilnWllKb2JumjDODvF3tTjceh+fhX3pTv4BDb/pE7ycD/U+kTOF3Ts8lXcJ8uQGw/PK0pmgP8CdY3ujBVB3r+h4HKoT/jXmXUptnwDj7RyOytnukFOVvUFnv9pH00P7JGgkIxnhEyH9/kUZdys0eJGl8UyjYOQQEM6zjKgPakWD7EGwHdso4ozMB+tp0RvjbFHS2kW8n7D2b+tr9wGI5Pi+ZI8 eSDcXEBL m8kA5nMTCcie9Mw6yeARvFOeqr4PO6vJBkvb8vDj41TnyBtHgzb+n+pgxgMb6ziyiLjj0sYzk+HIV519rdtZY/HbZ0k7jr9z+Nn1I7DBxkNaK7eDVSYhsCi3YTdXGAW6b+joaUe0uPPNVp5US/lriLVnE/dLtyMavtR3nP16k/QI/vIrjWOC9GBRAQUp6fpGlprL489r1lh2H6c4YCq03cEreBKdop/eMejrO0MPw1Mh/eTZg2/VXqmmw0H35jUK6KercTXN7dqWSoeRjPA6COQtHkKX95aG+qMKxn+6bHWG9XyinFw8VL1yaRF136Crv0X+2WUhHvEyxAVksiwJIPeTogHzu1angrDh7 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Mar 24, 2023 at 4:08=E2=80=AFAM Hugh Dickins wro= te: > > On Tue, 7 Mar 2023, David Stevens wrote: > > > From: David Stevens > > > > Make sure that collapse_file doesn't interfere with checking the > > uptodate flag in the page cache by only inserting hpage into the page > > cache after it has been updated and marked uptodate. This is achieved b= y > > simply not replacing present pages with hpage when iterating over the > > target range. The present pages are already locked, so replacing the > > with the locked hpage before the collapse is finalized is unnecessary. > > > > This fixes a race where folio_seek_hole_data would mistake hpage for > > an fallocated but unwritten page. This race is visible to userspace via > > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shm= em pages") > > Signed-off-by: David Stevens > > Acked-by: Peter Xu > > NAK to this patch, I'm afraid: it deadlocks. > > What I know it to deadlock against, does not make the most persuasive > argument: cgroup v1 deprecated memcg moving, where mc_handle_file_pte() > uses filemap_get_incore_folio() while holding page table lock, and spins > around doing "goto repeat" in filemap_get_entry() while folio_try_get_rcu= () > keeps failing because collapse_file()'s old page has been left in the > xarray with its refcount frozen to 0. Meanwhile, collapse_file() is > spinning to get that page table lock, to unmap pte of a later page. > > mincore()'s use of filemap_get_incore_folio() would be liable to hit > the same deadlock. If we think for longer, we may find more examples. > But even when not actually deadlocking, it's wasting lots of CPU on > concurrent lookups (e.g. faults) spinning in filemap_get_entry(). Ignoring my changes for now, these callers of filemap_get_incore_folio seem broken to some degree with respect to khugepaged. Mincore can show mlocked pages spuriously disappearing - this is pretty easy to reproduce with concurrent calls to MADV_COLLAPSE and mincore. As for the memcg code, I'm not sure how precise it is expected to be, but it seems likely that khugepaged can make task migration accounting less reliable (although I don't really understand the code). > I don't suppose it's entirely accurate, but think of keeping a folio > refcount frozen to 0 as like holding a spinlock (and this lock sadly out > of sight from lockdep). The pre-existing code works because the old page > with refcount frozen to 0 is immediately replaced in the xarray by an > entry for the new hpage, so the old page is no longer discoverable: > and the new hpage is locked, not with a spinlock but the usual > folio/page lock, on which concurrent lookups will sleep. Is it actually necessary to freeze the original pages? At least at a surface level, it seems that the arguments in 87c460a0bded ("mm/khugepaged: collapse_shmem() without freezing new_page") would apply to the original pages as well. And if it is actually necessary to freeze the original pages, why is it not necessary to freeze the hugepage for the rollback case? Rolling back hugepage->original pages seems more-or-less symmetric to collapsing original pages->hugepage. > Your discovery of the SEEK_DATA/SEEK_HOLE issue is important - thank > you - but I believe collapse_file() should be left as is, and the fix > made instead in mapping_seek_hole_data() or folio_seek_hole_data(): > I believe that should not jump to assume that a !uptodate folio is a > hole (as was reasonable to assume for shmem, before collapsing to huge > got added), but should lock the folio if !uptodate, and check again > after getting the lock - if still !uptodate, it's a shmem hole, not > a transient race with collapse_file(). This sounds like it would work for lseek. I guess it could maybe be made to sort of work for mincore if we abort the walk, lock the page, restart the walk, and then re-validate the locked page. Although that's not exactly efficient. -David > I was (pleased but) a little surprised when Matthew found in 5.12 that > shmem_seek_hole_data() could be generalized to filemap_seek_hole_data(): > he will have a surer grasp of what's safe or unsafe to assume of > !uptodate in non-shmem folios. > > On an earlier audit, for different reasons, I did also run across > lib/buildid.c build_id_parse() using find_get_page() without checking > PageUptodate() - looks as if it might do the wrong thing if it races > with khugepaged collapsing text to huge, and should probably have a > similar fix. > > Hugh