From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CAFEC54E41 for ; Tue, 5 Mar 2024 09:06:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8A88F6B0088; Tue, 5 Mar 2024 04:06:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 855B3940007; Tue, 5 Mar 2024 04:06:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6D06A6B00A5; Tue, 5 Mar 2024 04:06:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 580816B0088 for ; Tue, 5 Mar 2024 04:06:55 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 34219C022C for ; Tue, 5 Mar 2024 09:06:55 +0000 (UTC) X-FDA: 81862405590.17.243F72A Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.18]) by imf12.hostedemail.com (Postfix) with ESMTP id 89A2B40019 for ; Tue, 5 Mar 2024 09:06:52 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=bTEt2162; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.18 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709629613; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IHhG0T3LT3tBX0+UPY085iWXXNPifWBLZzhaHS7AwT4=; b=WUOzPlQQKlfnvz0ScZwekldXizJE6YzcjnAT4hDKe9xxBpjyBbkOULVH5iioyHkA4nWK71 oXcTiAB1yOuC0qXz5/6AT5jW3cL2V0inAuN1Q2itfkn+RTpLJldkmBrF53i4TJnj17yHXn mXWaBJmq2V+zmgUqCQcN43E6SuwgQw0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709629613; a=rsa-sha256; cv=none; b=La45Ke4PAmOBDDdKG0WoirFcp1SP8lTwlKY0Tqc7DyaDjNnc7/50rUWsct2d3Tq+IV4/hW pwb20jLk8q02XRMivXm/gHVwZUZGnaWOs8n0kRsXlv5F8Gh4/IN12CaER1jyy7wBHQFoqL mNg7/b+3Us4IbSabqvbx+3CJKzpr864= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=bTEt2162; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.18 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709629613; x=1741165613; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=SVDc3/QVD4qQIeEwISkNQ6ehRKSAArCRXoPluhdHuS8=; b=bTEt2162wNeOQkZsA5Uz/ANBUEEUFtHH0HmKzze4DQPbGFTDhiwvRmi4 dTSDYAXmy/3uPhY5dhw4h32gBpppS2N7huCqDnfIJtetrm047OLrFR2lx ZUUxNy9A1JehhSbVoU1eemTcm9ghbIV5AkQhLekM58/eO7c2uEhsSyXYo 8LnNuCUkY7Y+M+xQqf5oVYHt7bjH3Q4Sn+/9QPf4+Me/7XmpnstdFT25+ 7JoMlLFgZDnOxMFtjl1raxn1iQAie+Y2ZgFxW4tRbrn+M9X0jirX+8k+H wdJ9Y9xtd7QKFFjmBy7W6cJ4oCuXzCPi7i73iOIFGV6zBFzT+sAuftm8A w==; X-IronPort-AV: E=McAfee;i="6600,9927,11003"; a="4030333" X-IronPort-AV: E=Sophos;i="6.06,205,1705392000"; d="scan'208";a="4030333" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by fmvoesa112.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Mar 2024 01:06:51 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,205,1705392000"; d="scan'208";a="32471355" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Mar 2024 01:06:47 -0800 From: "Huang, Ying" To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, david@redhat.com, ryan.roberts@arm.com, chrisl@kernel.org, yuzhao@google.com, hanchuanhua@oppo.com, linux-kernel@vger.kernel.org, willy@infradead.org, xiang@kernel.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, Barry Song , Hugh Dickins Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio In-Reply-To: (Barry Song's message of "Tue, 5 Mar 2024 21:56:02 +1300") References: <20240304103757.235352-1-21cnbao@gmail.com> <878r2x9ly3.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 05 Mar 2024 17:04:53 +0800 Message-ID: <87msrd82x6.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Stat-Signature: ctow4fywhbqu7biuqeyuntfzz3ex7tct X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 89A2B40019 X-Rspam-User: X-HE-Tag: 1709629612-364054 X-HE-Meta: U2FsdGVkX1+H+B31NblqMAQH7fzhGc6Qlg4kT1sHBTNTVsElc0duS9jb3Eqoq9fR6qyRVo1+KBl8TD+9HwR7YinXPDyz5pOP/xpQeDZifmUmQaQtjNc0Nc7gBIKgx/7AWOPrs1+n36jAUJQH9IsgS3XqH4gau9+q/EP9laXK8BdVw7YhIYWfiJjVFOrFIZ/0FdqKHrrdb88UuuAOUoXFvhiUniEsTdiHsTuKqbb4ebGR5p1ZQ4k58VO2HywcKwrjdWzof/kkjysdz+JGJmOeLnbuqtb8eCoslVIJNQCkyS4RVUuSN2tjAx5DbE1ZSL+Wjts/2mDM9LIWViQBF5ak1Igt8vMEYJekb7KnXOPeDGGUfesTYdso1LxBGQJEYGSa0ut8zwgmErYfpo6aFKgTDwmj6saePaj7hvbg0GXpFwHBTJ2w0W3YWvDDc5mN7Zk1/SeI0JFORg75RInOEa3OuperV047Y4CMRgJQCkl02W6UwzG7cKsyeK++HKVC4uSJ7PeJwksSN4nHlImogRwS+mIeW+NKmSH0gAzTYePIY6mjJEtydCaY8+tdBBAj3R8QTOgor+lbYy50Ny0U/Bhilh7enhE+5GsEPYicmLhA35CesLjLQtGEVLupVjfPZG65xKsliiqEW0RxQdW5CzifKcrJJhevhKIu/+uEAF5PdQSI9+LRW3jVRsjKJAbeoLm5XHjIyeI1kYx73jmALD6UTbvNvy+Wpn+GgwHqe/vmErgMaYw0qyazwyNvf4n3oMIR+CucAArEsDVYIDtkEishd1dydmmKCrdvswmfk0Etgk4bfa15Z2Jv7SuYB2Wal1meuiGRP5yf7He36OpZtAHJAfM7jo5uDyNW X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Barry Song <21cnbao@gmail.com> writes: > On Tue, Mar 5, 2024 at 8:30=E2=80=AFPM Huang, Ying = wrote: >> >> Barry Song <21cnbao@gmail.com> writes: >> >> > From: Barry Song >> > >> > page_vma_mapped_walk() within try_to_unmap_one() races with other >> > PTEs modification such as break-before-make, while iterating PTEs >> >> Sorry, I don't know what is "break-before-make", can you elaborate? >> IIUC, ptep_modify_prot_start()/ptep_modify_prot_commit() can clear PTE >> temporarily, which may cause race with page_vma_mapped_walk(). Is that >> the issue that you try to fix? > > we are writing pte to zero(break) before writing a new value(make). OK. Is break and make is commonly used terminology in kernel? If not, it's better to explain a little (e.g., ptep_get_and_clear() / modify / set_pte_at()). > while > this behavior is within PTL in another thread, page_vma_mapped_walk() > of try_to_unmap_one thread won't take PTL till it meets a present PTE. IIUC, !pte_none() should be enough? > for example, if another threads are modifying nr_pages PTEs under PTL, > but we don't hold PTL, we might skip one or two PTEs at the beginning of > a large folio. > For a large folio, after try_to_unmap_one(), we may result in PTE0 and PT= E1 > untouched but PTE2~nr_pages-1 are set to swap entries. > > by holding PTL from PTE0 for large folios, we won't get these intermediate > values. At the moment we get PTL, other threads have done. Got it! Thanks! -- Best Regards, Huang, Ying >> >> -- >> Best Regards, >> Huang, Ying >> >> > of a large folio, it will only begin to acquire PTL after it gets >> > a valid(present) PTE. break-before-make intermediately sets PTEs >> > to pte_none. Thus, a large folio's PTEs might be partially skipped >> > in try_to_unmap_one(). >> > For example, for an anon folio, after try_to_unmap_one(), we may >> > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. >> > So folio will be still mapped, the folio fails to be reclaimed. >> > What=E2=80=99s even more worrying is, its PTEs are no longer in a unif= ied >> > state. This might lead to accident folio_split() afterwards. And >> > since a part of PTEs are now swap entries, accessing them will >> > incur page fault - do_swap_page. >> > It creates both anxiety and more expense. While we can't avoid >> > userspace's unmap to break up unified PTEs such as CONT-PTE for >> > a large folio, we can indeed keep away from kernel's breaking up >> > them due to its code design. >> > This patch is holding PTL from PTE0, thus, the folio will either >> > be entirely reclaimed or entirely kept. On the other hand, this >> > approach doesn't increase PTL contention. Even w/o the patch, >> > page_vma_mapped_walk() will always get PTL after it sometimes >> > skips one or two PTEs because intermediate break-before-makes >> > are short, according to test. Of course, even w/o this patch, >> > the vast majority of try_to_unmap_one still can get PTL from >> > PTE0. This patch makes the number 100%. >> > The other option is that we can give up in try_to_unmap_one >> > once we find PTE0 is not the first entry we get PTL, we call >> > page_vma_mapped_walk_done() to end the iteration at this case. >> > This will keep the unified PTEs while the folio isn't reclaimed. >> > The result is quite similar with small folios with one PTE - >> > either entirely reclaimed or entirely kept. >> > Reclaiming large folios by holding PTL from PTE0 seems a better >> > option comparing to giving up after detecting PTL begins from >> > non-PTE0. >> > >> > Cc: Hugh Dickins >> > Signed-off-by: Barry Song >> > --- >> > mm/vmscan.c | 11 +++++++++++ >> > 1 file changed, 11 insertions(+) >> > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index 0b888a2afa58..e4722fbbcd0c 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct li= st_head *folio_list, >> > >> > if (folio_test_pmd_mappable(folio)) >> > flags |=3D TTU_SPLIT_HUGE_PMD; >> > + /* >> > + * if page table lock is not held from the first= PTE of >> > + * a large folio, some PTEs might be skipped bec= ause of >> > + * races with break-before-make, for example, PT= Es can >> > + * be pte_none intermediately, thus one or more = PTEs >> > + * might be skipped in try_to_unmap_one, we migh= t result >> > + * in a large folio is partially mapped and part= ially >> > + * unmapped after try_to_unmap >> > + */ >> > + if (folio_test_large(folio)) >> > + flags |=3D TTU_SYNC; >> > >> > try_to_unmap(folio, flags); >> > if (folio_mapped(folio)) { > > Thanks > Barry