From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 292D2C5478C for ; Mon, 4 Mar 2024 13:03:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9B3BF6B0088; Mon, 4 Mar 2024 08:03:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 960146B008A; Mon, 4 Mar 2024 08:03:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 84F436B008C; Mon, 4 Mar 2024 08:03:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 744866B0088 for ; Mon, 4 Mar 2024 08:03:45 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 3CCCA40AD2 for ; Mon, 4 Mar 2024 13:03:45 +0000 (UTC) X-FDA: 81859373610.21.FDB75EC Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf29.hostedemail.com (Postfix) with ESMTP id 602C912000B for ; Mon, 4 Mar 2024 13:03:42 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf29.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709557422; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=n4XMU1WTWucvX9OenxGlAgrnYffhiZY+yhlKoD1Gb68=; b=aQTKoRmElQCMsXmJj8s5LWy9R+mJnCevLhIvytNcyJpT4pIyzuKzDVS8zK//JmrDzyrTzb YlMWj8GUK8fj3Ivopa1gAvqxJ4dZAeHXi+LgpIXsS04Vflx3IENNeQc0vW2LMzpdHbGIJp Rc1xfsY2PQhe5b+EuHEsPtm56Tj8FjY= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf29.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709557422; a=rsa-sha256; cv=none; b=OnszUnO2uVVhsXUf1jvdjGFsOdsjVgPcjS9vT0hcoqxRgECjQ4Ij8Eay4Vh2ralrVg6/4n ZzTfMYYrlnqnkNPLzRyi4D0qSpECUuu+YZ6T96bmL5GE+6f1QzNqbo7IkultAFfC2nSTlA tn0PQyP9v3Z5X3L/ACeF5rQ+I3h8LME= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0F0481FB; Mon, 4 Mar 2024 05:04:18 -0800 (PST) Received: from [10.57.68.92] (unknown [10.57.68.92]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 91A453F738; Mon, 4 Mar 2024 05:03:38 -0800 (PST) Message-ID: <10f9542e-f3d8-42b0-9de4-9867cab997b9@arm.com> Date: Mon, 4 Mar 2024 13:03:36 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio Content-Language: en-GB To: David Hildenbrand , Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org Cc: chrisl@kernel.org, yuzhao@google.com, hanchuanhua@oppo.com, linux-kernel@vger.kernel.org, willy@infradead.org, ying.huang@intel.com, xiang@kernel.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, Barry Song , Hugh Dickins References: <20240304103757.235352-1-21cnbao@gmail.com> <706b7129-85f6-4470-9fd9-f955a8e6bd7c@arm.com> <37f1e6da-412b-4bb4-88b7-4c49f21f5fe9@redhat.com> From: Ryan Roberts In-Reply-To: <37f1e6da-412b-4bb4-88b7-4c49f21f5fe9@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 602C912000B X-Stat-Signature: acp1ahpnu3ii3e1jktdwaspwc8wk7yhz X-HE-Tag: 1709557422-767357 X-HE-Meta: U2FsdGVkX18+90z6akWClET+ah43NX6maDFl3yUUZWGzxsRKS5ud6NW8EydOoCY9eevqnUI5VBn15y/1yAIXAQ64yyRVqcVLbv6TGeqA8mfrAmXCiaTy+mpBjjUlr21WxTT1SYsoqihfj5Cb8tJtH6GrI/f/PEjg+K7rwarjf970gDHYNvdtQo5D8aT/qHE+wpG0HwReJJDtaTBzewzFb4g6frzEDt/E5iD3+O3fCqf1a64ezJIj7L3fOa6hw60Mw8d1/69wU75w2NZx24MpJxCnMN1oRq8gUrQsmFmhOodpwpdF4gg++tztErSpE7ow1rK1tCi252IoWTTymHIV/+Td7689HG21YsFusExeq6RnmofD9CkvL7vXo9uadrpwsi7jYu8lbBHJlJzmOyCwCX1vu/XPst31wWY0E9+ZsaobBBJZHjSRDuDaaepAR36AB6ZEgzBRRWp9HVovuKq8KUq8AbAHmH8VRipBDAWp1S15axn371I1huOPPWxw+xxT29YQcDu466yEKc2w8NHirrDnee/gnWzZofReqlZ/dw1c0rlR+jS9ObsqBEJ+N4cbebUYMr4ucO5n0JF3eR5CIE8bA6/9LJrS1ZEtbg3KqDPzldO4IUQ1DcMaSv7QwRlwo03ggirVlKvC1syxt/fKqdZvVWg7MO+shgsqvXutMA1xVS8cIk1yvG7kRGFycXLS+0RQoyskcykewL+HBCM8zQPsYz+oJHVeY59Ure14g+qjs5tIZnCiv6U23ttZZF6caMx8pXSv1AinMdNLKgd7v1ylpvvYr1aJ7o29jc09bKQABvKAUmJS4NSTRaq92opylIzSaZzFTVdbXVacIe0/xszJt+Vqt6Sj1KhHLPJwyIX3i+IZV/15rnVi59zm0YWmthrRhe33RW880S4x7Y87wWj/pq0aui7k3ianXb1Y/y7283eRVsiJmsmYCdg9ut6755yt/zN+8iZc7fDv1bR gwmKnpRi w2BJ1xJL20+ChWuY6vVTNMk9A8OuqnFgO6+LsL7SqqAWAhOO79sFp3HqekbCO7BDzO5SLsveHiPfYcBgme33sk2EixhTm1psINUVHVrhY7dx5IghmI9zpFPCfpJqYqCgRy2tN5GMUEGKVwdFTaiz9aYH9hrezgYI2IKJCD33uNcwhipGr7Hr9Bm6OJ5ee7Mr5u79/fREdaPtgkuAW8ArGq6rtt1y+3TyjQMwwh4eyN4iKcYyco5iupLR+A8KK05Xqv3/0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 04/03/2024 12:41, David Hildenbrand wrote: > On 04.03.24 13:20, Ryan Roberts wrote: >> Hi Barry, >> >> On 04/03/2024 10:37, Barry Song wrote: >>> From: Barry Song >>> >>> page_vma_mapped_walk() within try_to_unmap_one() races with other >>> PTEs modification such as break-before-make, while iterating PTEs >>> of a large folio, it will only begin to acquire PTL after it gets >>> a valid(present) PTE. break-before-make intermediately sets PTEs >>> to pte_none. Thus, a large folio's PTEs might be partially skipped >>> in try_to_unmap_one(). >> >> I just want to check my understanding here - I think the problem occurs for >> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now >> that I've had a look at the code and have a better understanding, I think that >> must be the case? And therefore this problem exists independently of my work to >> support swap-out of mTHP? (From your previous report I was under the impression >> that it only affected mTHP). >> >> Its just that the problem is becoming more pronounced because with mTHP, >> PTE-mapped large folios are much more common? > > That is my understanding. > >> >>> For example, for an anon folio, after try_to_unmap_one(), we may >>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. >>> So folio will be still mapped, the folio fails to be reclaimed. >>> What’s even more worrying is, its PTEs are no longer in a unified >>> state. This might lead to accident folio_split() afterwards. And >>> since a part of PTEs are now swap entries, accessing them will >>> incur page fault - do_swap_page. >>> It creates both anxiety and more expense. While we can't avoid >>> userspace's unmap to break up unified PTEs such as CONT-PTE for >>> a large folio, we can indeed keep away from kernel's breaking up >>> them due to its code design. >>> This patch is holding PTL from PTE0, thus, the folio will either >>> be entirely reclaimed or entirely kept. On the other hand, this >>> approach doesn't increase PTL contention. Even w/o the patch, >>> page_vma_mapped_walk() will always get PTL after it sometimes >>> skips one or two PTEs because intermediate break-before-makes >>> are short, according to test. Of course, even w/o this patch, >>> the vast majority of try_to_unmap_one still can get PTL from >>> PTE0. This patch makes the number 100%. >>> The other option is that we can give up in try_to_unmap_one >>> once we find PTE0 is not the first entry we get PTL, we call >>> page_vma_mapped_walk_done() to end the iteration at this case. >>> This will keep the unified PTEs while the folio isn't reclaimed. >>> The result is quite similar with small folios with one PTE - >>> either entirely reclaimed or entirely kept. >>> Reclaiming large folios by holding PTL from PTE0 seems a better >>> option comparing to giving up after detecting PTL begins from >>> non-PTE0. >>> > > I'm sure that wall of text can be formatted in a better way :) . Also, I think > we can drop some of the details, > > If you need some inspiration, I can give it a shot. > >>> Cc: Hugh Dickins >>> Signed-off-by: Barry Song >> >> Do we need a Fixes tag? >> > > What would be the description of the problem we are fixing? > > 1) failing to unmap? > > That can happen with small folios as well IIUC. > > 2) Putting the large folio on the deferred split queue? > > That sounds more reasonable. Isn't the real problem today that we can end up writng a THP to the swap file (so 2M more IO and space used) but we can't remove it from memory, so no actual reclaim happens? Although I guess your (2) is really just another way of saying that. > >>> --- >>>   mm/vmscan.c | 11 +++++++++++ >>>   1 file changed, 11 insertions(+) >>> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>> index 0b888a2afa58..e4722fbbcd0c 100644 >>> --- a/mm/vmscan.c >>> +++ b/mm/vmscan.c >>> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head >>> *folio_list, >>>                 if (folio_test_pmd_mappable(folio)) >>>                   flags |= TTU_SPLIT_HUGE_PMD; >>> +            /* >>> +             * if page table lock is not held from the first PTE of >>> +             * a large folio, some PTEs might be skipped because of >>> +             * races with break-before-make, for example, PTEs can >>> +             * be pte_none intermediately, thus one or more PTEs >>> +             * might be skipped in try_to_unmap_one, we might result >>> +             * in a large folio is partially mapped and partially >>> +             * unmapped after try_to_unmap >>> +             */ >>> +            if (folio_test_large(folio)) >>> +                flags |= TTU_SYNC; >> >> This looks sensible to me after thinking about it for a while. But I also have a >> gut feeling that there might be some more subtleties that are going over my >> head, since I'm not expert in this area. So will leave others to provide R-b :) >> > > As we are seeing more such problems with lockless PT walks, maybe we really want > some other special value (nonswap entry?) to indicate that a PTE this is > currently ondergoing protection changes. So we'd avoid the pte_none() > temporarily, if possible. > > Without that, TTU_SYNC feels like the right thing to do. >