From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 00586C5478C for ; Tue, 5 Mar 2024 07:30:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 87D966B0085; Tue, 5 Mar 2024 02:30:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 82DD16B008A; Tue, 5 Mar 2024 02:30:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F6F66B008C; Tue, 5 Mar 2024 02:30:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 5EEC66B0085 for ; Tue, 5 Mar 2024 02:30:57 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 3747414021C for ; Tue, 5 Mar 2024 07:30:57 +0000 (UTC) X-FDA: 81862163754.11.C912D2F Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) by imf06.hostedemail.com (Postfix) with ESMTP id 60EAE18000A for ; Tue, 5 Mar 2024 07:30:54 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=fDKUZybT; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf06.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709623855; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XVOmKXuk+I+1j+WQbD3GFXZxHak60LCzTjqgb2kTwfU=; b=ZUhl/NtNXzXYH5A036iaUmmTU9HLdZEHPIaq2DL+2pdoG+jdj0H513tLhCDn4N3Vw36oGj 8wypMIHrdScJGyV3SPC9+cCpTomdw4FacoiREAeGF21y4bTuoTMnyoViWQxtMr7nwX6XDu hFTvSu/rOI/7vMRUfObQTJc9+AN2YIg= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=fDKUZybT; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf06.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709623855; a=rsa-sha256; cv=none; b=V39HaYeKa1KxRJQq/o4xUsN0aNVnwVhpgWkjb1EQMnu+j2HdwV9hX/OVz/LJKnV9UX2kvN 1ItOau0JduWZm6SChET9cHueSfI7eY0hwDQ2YKe8l3U/qQQZHdTtphLQNbxObQa4ONSzL1 rEG8zvUl9sPNYPw7orBaVx46hvz2k2c= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709623855; x=1741159855; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=EDxT4L0OSxDAkEd0ZQSl0ECYq1LMfRvCmOH5WtqS+h4=; b=fDKUZybT0vCwFihR5HTA38BdW2PHSPDVJyivcngxQpgqf7FFD0BFeAfW 9vNmCEPlCPLN+fvFhST25C+ptQ1KSNkXM7g352pqjuD+TzIRjY+s1GJQl FHNGcBz7y88ajhpYFxEqXPsD1zveWmmdfOSnfVN/5N9VpqUgO6dGh+X+c 7SW27E/NwkwJorRL8Bno/LyBjQoVoob3f4YzHEHvhFL7Ta5gbI3PWlc6A S+T8NXbKLbftCy5SZFlwp0mXj8G/uFYdK884o2RBQMgIJHT6jhOAmrvyS 9Db8eBCidG3K2BIgTQytrsel5hZ/k1VKZJw+RC18ilzcP6Z5hhQsdSjaU w==; X-IronPort-AV: E=McAfee;i="6600,9927,11003"; a="7970837" X-IronPort-AV: E=Sophos;i="6.06,205,1705392000"; d="scan'208";a="7970837" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Mar 2024 23:30:36 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,205,1705392000"; d="scan'208";a="13861172" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Mar 2024 23:30:31 -0800 From: "Huang, Ying" To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, david@redhat.com, ryan.roberts@arm.com, chrisl@kernel.org, yuzhao@google.com, hanchuanhua@oppo.com, linux-kernel@vger.kernel.org, willy@infradead.org, xiang@kernel.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, Barry Song , Hugh Dickins Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio In-Reply-To: <20240304103757.235352-1-21cnbao@gmail.com> (Barry Song's message of "Mon, 4 Mar 2024 23:37:57 +1300") References: <20240304103757.235352-1-21cnbao@gmail.com> Date: Tue, 05 Mar 2024 15:28:36 +0800 Message-ID: <878r2x9ly3.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: 8hsg7iu4tqsu85kctr1ogtqzpkibzh37 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 60EAE18000A X-HE-Tag: 1709623854-178906 X-HE-Meta: U2FsdGVkX18041qt7uIKo3+gCH2PPLMl60YnrVWlT14+EHma4Jy+C8Lk9GAemUdPY8qNJLl0l0d5IZpCIQW5lUgIOBN9R1QuAN7yp/bC58xOp8j2a0n1Fqh1acvb3x+JEP29ve89oX45HD6LqlxIBVAufESqZFFy3cnBTxtWSY90NwUkQnXvrsSeFY74u2WSfQmdxlhVwkFsMQiMK2rkZaQCkmLnudoMQXVXlHD3Gn+yNb3CJ+WoxVEXTyu4M6Yhe08LZk4IWiEXiMFs1u0clq8zKqA0TgbLeQGaRh2POLM3UosTfKIYiZ/pUfwSn4n7nz1PCuGAmMq+kCkkdpPZUUDTP1hdG7Mo9HkBqMVzaJzx5Eb/EL/4O8BRkTZ23e9ZaYHY2XsS40Msa9xWQX5kOBfg9UI5Zi+rqY9PrC6xz4NUzxaCrOA2RmYlj7+y4+0z7wjwf6Rfv/SpWHjXyBh+C+x7p1h6n75a8tMGwZ8ZA4VCSRgPE4UDxffjBqL1JvOFC9vbtyhYEA4lPOeXR43a1Q5RdWLlYdvvbOcYNT/faNlmKm6djRiHCKauxLWmHbLu7OtbjPeMQ/R9+qVavSFaXTViolSb3CMdQJlNGtZEotW3HG6cMbKyBXcSoyCcd3MfRMyMSPyYPpZfWFhtt+gv4iPR/UECD/EigZ5lSOaRd6A32rzcAw6RH5A8HOkjZspUJFQpHRFbID9ZOxwTlwFL9gl4TmLLxD061xitW+iMEmPlQF1JLr1hhLf5fNMJdSM/SNg24oSWndbQyUds3n7v7Re3UQMkp3hsFdpUs25tU2fHJRLFQA+VMehs4+ju6+FdNyY6Xau14Au50SRhi60WEQL4QmCLL8uKkIQ1umlI7v4OhhiTdnjngr3zffYdzaEwEkcFTqV+ycMfjFNc0tOIxAWqXcsVkBZXw1iannwdnjKLzwq9bM/dYF0mpSHmOqz2CeOgTMfPYKrXOXtwGrs lkQ+smWd LeHiH435GcPEFhZh+/9Q2TzSAJ/X+rkhELzqv1vLNVvqyw7V3r21kwPeL/6B6+hEYhkGQzZS+YcevXRGn3Q+ZbxSn5aY+98Mmje/kN6c3KQpzbQzr+3x5nDvFVj6uffAY+drKfx3KxOhZ3MDKLjCSHxVjngaHmahDnbsgEeNHGaQql+RKjDjYyNPM4nZHImkOpXxm/sh2U5BHmSHksRRmEXzKFxDpo23nAoiZVbqOZOVSZP6U6SSyFaKVyXh+WdpaXXy5qbGp1ijnB59G/I+HjcvQCw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Barry Song <21cnbao@gmail.com> writes: > From: Barry Song > > page_vma_mapped_walk() within try_to_unmap_one() races with other > PTEs modification such as break-before-make, while iterating PTEs Sorry, I don't know what is "break-before-make", can you elaborate? IIUC, ptep_modify_prot_start()/ptep_modify_prot_commit() can clear PTE temporarily, which may cause race with page_vma_mapped_walk(). Is that the issue that you try to fix? -- Best Regards, Huang, Ying > of a large folio, it will only begin to acquire PTL after it gets > a valid(present) PTE. break-before-make intermediately sets PTEs > to pte_none. Thus, a large folio's PTEs might be partially skipped > in try_to_unmap_one(). > For example, for an anon folio, after try_to_unmap_one(), we may > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > So folio will be still mapped, the folio fails to be reclaimed. > What=E2=80=99s even more worrying is, its PTEs are no longer in a unified > state. This might lead to accident folio_split() afterwards. And > since a part of PTEs are now swap entries, accessing them will > incur page fault - do_swap_page. > It creates both anxiety and more expense. While we can't avoid > userspace's unmap to break up unified PTEs such as CONT-PTE for > a large folio, we can indeed keep away from kernel's breaking up > them due to its code design. > This patch is holding PTL from PTE0, thus, the folio will either > be entirely reclaimed or entirely kept. On the other hand, this > approach doesn't increase PTL contention. Even w/o the patch, > page_vma_mapped_walk() will always get PTL after it sometimes > skips one or two PTEs because intermediate break-before-makes > are short, according to test. Of course, even w/o this patch, > the vast majority of try_to_unmap_one still can get PTL from > PTE0. This patch makes the number 100%. > The other option is that we can give up in try_to_unmap_one > once we find PTE0 is not the first entry we get PTL, we call > page_vma_mapped_walk_done() to end the iteration at this case. > This will keep the unified PTEs while the folio isn't reclaimed. > The result is quite similar with small folios with one PTE - > either entirely reclaimed or entirely kept. > Reclaiming large folios by holding PTL from PTE0 seems a better > option comparing to giving up after detecting PTL begins from > non-PTE0. > > Cc: Hugh Dickins > Signed-off-by: Barry Song > --- > mm/vmscan.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 0b888a2afa58..e4722fbbcd0c 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_= head *folio_list, >=20=20 > if (folio_test_pmd_mappable(folio)) > flags |=3D TTU_SPLIT_HUGE_PMD; > + /* > + * if page table lock is not held from the first PTE of > + * a large folio, some PTEs might be skipped because of > + * races with break-before-make, for example, PTEs can > + * be pte_none intermediately, thus one or more PTEs > + * might be skipped in try_to_unmap_one, we might result > + * in a large folio is partially mapped and partially > + * unmapped after try_to_unmap > + */ > + if (folio_test_large(folio)) > + flags |=3D TTU_SYNC; >=20=20 > try_to_unmap(folio, flags); > if (folio_mapped(folio)) {