From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B858C48BF6 for ; Mon, 4 Mar 2024 21:04:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DB79B6B008C; Mon, 4 Mar 2024 16:04:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D407B6B0092; Mon, 4 Mar 2024 16:04:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BBCAD6B0093; Mon, 4 Mar 2024 16:04:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A37EF6B008C for ; Mon, 4 Mar 2024 16:04:49 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6023BA02F0 for ; Mon, 4 Mar 2024 21:04:49 +0000 (UTC) X-FDA: 81860585898.23.90462C3 Received: from mail-oi1-f169.google.com (mail-oi1-f169.google.com [209.85.167.169]) by imf08.hostedemail.com (Postfix) with ESMTP id 58386160015 for ; Mon, 4 Mar 2024 21:04:47 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=X+v5w940; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.167.169 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709586287; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8TVmEhSKhlDR6P3G5ys+4kcir1aOuZN32lH8jTi+5V4=; b=YFd0WlceLov0eSqEfsjvOIyKHn0DyzV4wCmZa+BbMfro76fwKZ4LLA23/SgCRP5uOY79cO nBsEHTzxuGrxWoioQrwTiKutEpzkyBMU0y/jPwtq+NWInhErD1aic6bcq7/AiCI0O6hV5u JBLhrA+kLkXL+23+GaGHViEzxCOYV/M= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709586287; a=rsa-sha256; cv=none; b=yQuBGzdYT1uFFLuPN42Ilf+m/1sDqFlZvSAAh1mWPcuH2/wZ9+cOM87IXbNsxhS0Bx8ZGZ GABnwsGJYra6mgbrnhRqRKCrx7wN7gnZ3wqdhExHCPDUzgIpkF2q8H47YdSPAqWHC1XB5X 6dLjW6MgaR7ewJcZw6jfDlOlV+rujcE= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=X+v5w940; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.167.169 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-oi1-f169.google.com with SMTP id 5614622812f47-3c1ea5e54d3so998363b6e.0 for ; Mon, 04 Mar 2024 13:04:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709586286; x=1710191086; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8TVmEhSKhlDR6P3G5ys+4kcir1aOuZN32lH8jTi+5V4=; b=X+v5w940m9BkYf8DzJWjaDSHliX8MphXnW+vv3049ANxcWj3UyrDm2eFyYWMub59t3 QY7/4Cr+USypV285vtl6kopp38RBzymwzgJ2jyuhKVn5fJCoZNUuVxGkh/EPrLcsAp0r 6Oce4uiR09contsrFC0sLWwttOYIoNxPX308OxExVe4BVKcOkqbJ9pckdNdYnWjoG7ul TSA9x583PQmml3g26I0NpYrG0McNIaf/1AZ/8LQ/UuVeZfw6Ycw/+GC80bzYEynzp5HR fiIJE9/IzcUU5/ZD938/hFc/SJI+0rRoQD2yrzYpDi0ohPhdgYr3Crr0f53jB7sOE81A vPBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709586286; x=1710191086; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8TVmEhSKhlDR6P3G5ys+4kcir1aOuZN32lH8jTi+5V4=; b=B/8tpy5NnCtZfN1oEMT6EKCCwwcTnrKMNJvvk8KHHSm2EdRHH6rcckVNWrxqaiIzrj st/dZIMmWcWPywNg0mZUSL5i8H0h2hKgGt2HhTYFjMNIx7G8fi/pJeEYWICPEqhyivQP EDQnJmp2mnHbGB3pEqajHSXkzFYUzP+p+Z3n6I6O3jw/WK/U1G5LaK9QHHn8LEGU+vjn HIJIkF2+Bq/mqoxoRnJhJ7oQ97Q9s6stPS8rUsRW0svdMlgfTfl0uD5AabKtFBFbgi5a G97pfwz9aD7wgB3dTFiO4PFEdwuqjlAIqmFDJzMaNIBXplJy79KITTovqUpKFfgzkMuo RPtQ== X-Forwarded-Encrypted: i=1; AJvYcCVuk36dCFB3P36fMHp9TDl8MhS5A2x2Xo+qwSwtOStGxpZX36ngfosBjdKkXm/cQOAvSCUAzvWKeOJdFhcvVCVDeqA= X-Gm-Message-State: AOJu0YxPreoe7L1eE3LYoySgfq770RhA8TYrUZbqHLryJP0xTXKC0f8Y QDDHh8uKwYGB2ofePnBMKnMp9uR6ErWWzQCP53PapbDO4vHZEko/ylVkjurhAzNFU9E4NjxefRh 959PB/KFdtULYRQvvgNFc7vPz3o4= X-Google-Smtp-Source: AGHT+IH40sUK7/cQYOFbmLuq2L+jqwEit/A3MGsAyOy/hs1vuPRuz9cQpYooWpOftsRNDh2IOhs4cox54PoSdEH3TvU= X-Received: by 2002:a05:6808:f04:b0:3c2:66f:8be6 with SMTP id m4-20020a0568080f0400b003c2066f8be6mr1223511oiw.43.1709586286341; Mon, 04 Mar 2024 13:04:46 -0800 (PST) MIME-Version: 1.0 References: <20240304103757.235352-1-21cnbao@gmail.com> <706b7129-85f6-4470-9fd9-f955a8e6bd7c@arm.com> <37f1e6da-412b-4bb4-88b7-4c49f21f5fe9@redhat.com> In-Reply-To: <37f1e6da-412b-4bb4-88b7-4c49f21f5fe9@redhat.com> From: Barry Song <21cnbao@gmail.com> Date: Tue, 5 Mar 2024 10:04:34 +1300 Message-ID: Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio To: David Hildenbrand Cc: Ryan Roberts , akpm@linux-foundation.org, linux-mm@kvack.org, chrisl@kernel.org, yuzhao@google.com, hanchuanhua@oppo.com, linux-kernel@vger.kernel.org, willy@infradead.org, ying.huang@intel.com, xiang@kernel.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, Barry Song , Hugh Dickins Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: pk4rjcyqo8wpy8f11bq48cmh77oz1dse X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 58386160015 X-Rspam-User: X-HE-Tag: 1709586287-711582 X-HE-Meta: U2FsdGVkX1/3wUxDKmkoOuk7cqLkLGqVHpyaRZw3lNewYa5yHvcODXSPrP98KXdTWSksaXJm3sm6EcdsxcfZ94XNzi2WWMBrWGVBoYGYGQPvM7Ok49+fRetbNmTR87pOi/XK61sD27QuhpY5ai+rXhfXMZjae+r3utRAqL1DmaUqnzasYoigKNcNT97olCUtisgRTgH5DmJamx3lJLmGRt++f1DlgxWWlo+d2/3RPV4tlvsXijb1bC8dIHfUVnfaB1jER8Sspz32/pKAvEGrQPrcZz7+ge4PL8fmqwIrVeXG+4WnB0Wlw3k+ZQ+7rhqllkvqwBt2GobYlnvOdnBrUiitD9SmmCAP3pux7a46kZ7gG97dTrc2xTo727ebKnKeJfCx/7EPBXTqOnkTU9tLpspLOGJnCTDTizzgRt1XFPUTpvkqw6IkXIlMTO3N3Oc6AmMir0fTPyC94GOy9f+NPrFWJlf7tfnJk+GbtJIVz5fk4Map/JWeAFIEcIxNuU8UkudvjO3WMHi6slo+I9GeB5OL4zFL60Q3j/OFAUNaxc4l4a4YnqVsxIalrAxAMIdOAg2ppFw2Vmrau84rVjxDheTYvAqD0EHjcBmF7y6htKHroOoICf/wZ531UnQVIw8W17ESzN42bdZ0e2+cTLMuLQy9jjXP3td1EhszDPf0hTuFdtwTt733K8RER/jKJ28LYpffpQPUsHR0rFuhS3FLyTLOey3hOJXbU85qrzNVM+XATDxsgPU73UPlZo9wJjJ4IXs7VKNGPRmiw8kd+hzaVah+AB6twt+i4P6pvy+4LMF9YniyrBVobIDjel9rUNMHWEiTFbmp7a1WqUaBqfdWCsnJleHIzn4rR8xr5aAwgM+7IySL4OfhDVvevjYNo2JCB4qTJi5grG2QUamOpysJp+mNpG5eyEeSM/YK0QhfZWjclIxVhHIxSEgeNa+5Oj3+Ey7t8e3mM6gc7IQzZNU ji4WtB3B cMXx2/3bixL/qQTTKEGDc5MUmVq4SAgfXd3UFzNB8Szep1+RC5748wtxiqneFWZd3hda2sfnPdICOVZD2mnXhBE0B+rETdqisGEX/025NF56MAm49hJOTLCobXl+y+jX/Tlzodm+NMW5k7I9XRMxRWvlWJfIX07lODk7dr0atDZWxBePyYfOyurE4SMo7X5urreiIL1zq/rpFGHzi0OTU+SudMU6S/e76QgdPJnrFvomIAmtZVYpyeevN6pN6thEEoz075mPLLm19sZ9Y00C+KnMTsurFTTYAE5aroYlvf+b9BWbUfJ93tfbEYTMYUDKfjLpxGV95N2uYpzsBGH7GMuHy1Iq0vZrh1icXNyf4HXJrhF8sbLbJzFJ3OjrfGBqJN95zgpKrmeeoEc3yR1H+h6H7BKpkRzYagqo/orDS+RASs17mrAuV1dcFxQ88/puxLOy697G5nJfWb/7Aw2yrlgFXekRaAl6hpRpK7aCqFvQeVQs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 5, 2024 at 1:41=E2=80=AFAM David Hildenbrand = wrote: > > On 04.03.24 13:20, Ryan Roberts wrote: > > Hi Barry, > > > > On 04/03/2024 10:37, Barry Song wrote: > >> From: Barry Song > >> > >> page_vma_mapped_walk() within try_to_unmap_one() races with other > >> PTEs modification such as break-before-make, while iterating PTEs > >> of a large folio, it will only begin to acquire PTL after it gets > >> a valid(present) PTE. break-before-make intermediately sets PTEs > >> to pte_none. Thus, a large folio's PTEs might be partially skipped > >> in try_to_unmap_one(). > > > > I just want to check my understanding here - I think the problem occurs= for > > PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large fol= ios? Now > > that I've had a look at the code and have a better understanding, I thi= nk that > > must be the case? And therefore this problem exists independently of my= work to > > support swap-out of mTHP? (From your previous report I was under the im= pression > > that it only affected mTHP). > > > > Its just that the problem is becoming more pronounced because with mTHP= , > > PTE-mapped large folios are much more common? > > That is my understanding. > > > > >> For example, for an anon folio, after try_to_unmap_one(), we may > >> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > >> So folio will be still mapped, the folio fails to be reclaimed. > >> What=E2=80=99s even more worrying is, its PTEs are no longer in a unif= ied > >> state. This might lead to accident folio_split() afterwards. And > >> since a part of PTEs are now swap entries, accessing them will > >> incur page fault - do_swap_page. > >> It creates both anxiety and more expense. While we can't avoid > >> userspace's unmap to break up unified PTEs such as CONT-PTE for > >> a large folio, we can indeed keep away from kernel's breaking up > >> them due to its code design. > >> This patch is holding PTL from PTE0, thus, the folio will either > >> be entirely reclaimed or entirely kept. On the other hand, this > >> approach doesn't increase PTL contention. Even w/o the patch, > >> page_vma_mapped_walk() will always get PTL after it sometimes > >> skips one or two PTEs because intermediate break-before-makes > >> are short, according to test. Of course, even w/o this patch, > >> the vast majority of try_to_unmap_one still can get PTL from > >> PTE0. This patch makes the number 100%. > >> The other option is that we can give up in try_to_unmap_one > >> once we find PTE0 is not the first entry we get PTL, we call > >> page_vma_mapped_walk_done() to end the iteration at this case. > >> This will keep the unified PTEs while the folio isn't reclaimed. > >> The result is quite similar with small folios with one PTE - > >> either entirely reclaimed or entirely kept. > >> Reclaiming large folios by holding PTL from PTE0 seems a better > >> option comparing to giving up after detecting PTL begins from > >> non-PTE0. > >> > > I'm sure that wall of text can be formatted in a better way :) . Also, I > think we can drop some of the details, > > If you need some inspiration, I can give it a shot. > > >> Cc: Hugh Dickins > >> Signed-off-by: Barry Song > > > > Do we need a Fixes tag? I am not quite sure which commit should be here for a fixes tag. I think it's more of an optimization. > > > > What would be the description of the problem we are fixing? > > 1) failing to unmap? > > That can happen with small folios as well IIUC. > > 2) Putting the large folio on the deferred split queue? > > That sounds more reasonable. I don't feel it is reasonable. Avoiding this kind of accident splitting from the kernel's improper code is a more reasonable approach as there is always a price to pay for splitting and unfolding PTEs etc. While we can't avoid splitting coming from userspace's MADV_DONTNEED, munmap, mprotect, we have a way to ensure the kernel itself doesn't accidently break up a large folio. In OPPO's phones, we ran into some weird bugs due to skipped PTEs in try_to_unmap_one. hardly could we fix it from the root cause. with various races, figuring out their timings was really a big pain :-) But we did "resolve" those bugs by entirely untouching all PTEs if we found some PTEs were skipped in try_to_unmap_one [1]. While we find we only get the PTL from 2nd, 3rd but not 1st PTE, we entirely give up on try_to_unmap_one, and leave all PTEs untouched. /* we are not starting from head */ if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) { ret =3D false; atomic64_inc(&perf_stat.mapped_walk_start_from_non_head)= ; set_pte_at(mm, address, pvmw.pte, pteval); page_vma_mapped_walk_done(&pvmw); break; } This will ensure all PTEs still have a unified state such as CONT-PTE after try_to_unmap fails. I feel this could have some false postive because when racing with unmap, 1st PTE might really become pte_none. So explicitly holding PTL from 1st PTE seems a better way. [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplu= s/sm8550_u_14.0.0_oneplus11/mm/rmap.c#L1730 > > >> --- > >> mm/vmscan.c | 11 +++++++++++ > >> 1 file changed, 11 insertions(+) > >> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c > >> index 0b888a2afa58..e4722fbbcd0c 100644 > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct li= st_head *folio_list, > >> > >> if (folio_test_pmd_mappable(folio)) > >> flags |=3D TTU_SPLIT_HUGE_PMD; > >> + /* > >> + * if page table lock is not held from the first = PTE of > >> + * a large folio, some PTEs might be skipped beca= use of > >> + * races with break-before-make, for example, PTE= s can > >> + * be pte_none intermediately, thus one or more P= TEs > >> + * might be skipped in try_to_unmap_one, we might= result > >> + * in a large folio is partially mapped and parti= ally > >> + * unmapped after try_to_unmap > >> + */ > >> + if (folio_test_large(folio)) > >> + flags |=3D TTU_SYNC; > > > > This looks sensible to me after thinking about it for a while. But I al= so have a > > gut feeling that there might be some more subtleties that are going ove= r my > > head, since I'm not expert in this area. So will leave others to provid= e R-b :) > > > > As we are seeing more such problems with lockless PT walks, maybe we > really want some other special value (nonswap entry?) to indicate that a > PTE this is currently ondergoing protection changes. So we'd avoid the > pte_none() temporarily, if possible. > > Without that, TTU_SYNC feels like the right thing to do. > > -- > Cheers, > > David / dhildenb > Thanks Barry