From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AF509C54E58
	for <linux-mm@archiver.kernel.org>; Wed, 20 Mar 2024 13:49:35 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E6FF56B0082; Wed, 20 Mar 2024 09:49:34 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E21456B0085; Wed, 20 Mar 2024 09:49:34 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CE7D86B0088; Wed, 20 Mar 2024 09:49:34 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id BFAEF6B0082
	for <linux-mm@kvack.org>; Wed, 20 Mar 2024 09:49:34 -0400 (EDT)
Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 320281213D1
	for <linux-mm@kvack.org>; Wed, 20 Mar 2024 13:49:34 +0000 (UTC)
X-FDA: 81917549868.24.D8506D4
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf06.hostedemail.com (Postfix) with ESMTP id 0903B180006
	for <linux-mm@kvack.org>; Wed, 20 Mar 2024 13:49:31 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=none;
	spf=pass (imf06.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710942572;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=jytNNpl44FjP0/O9umeQmzLebuHQ7KwkayHqTkicTJw=;
	b=Z6c3NQo7ifE4rrpDe7eTyP+5JJqcPgh2kCselt9HuS/KYPLLrxrnwcZSkQJTG4CF8DXGGE
	pNgQQiNDsBnI4m6e+2YpLjlF5wIKBXpUxKMlor14yJ0y9KqP3VVyeMpFaRbpxS9zUVebsS
	4oRmDA1SivIELya9bUyvdpCpGnB/OZs=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=none;
	spf=pass (imf06.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710942572; a=rsa-sha256;
	cv=none;
	b=0AJYLWX2zb4etUIaooAXGFPODQsKMOyJRVi7viG7QRhFsb2tJTXXjsreHIBQbmcKm+juP2
	x04XrZCrXO4XRbT+iFqky6lqavHNxTbYoeFqxkyeuPrXaySeX5WLfH4IRYIaz0qdSPk2SW
	KuqW74dmtfqaEL3e/RoeZB9IBpfkpzU=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E4B681007;
	Wed, 20 Mar 2024 06:50:03 -0700 (PDT)
Received: from [10.57.72.78] (unknown [10.57.72.78])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DECD83F64C;
	Wed, 20 Mar 2024 06:49:26 -0700 (PDT)
Message-ID: <7ba06704-2090-4eb2-9534-c4d467cc085a@arm.com>
Date: Wed, 20 Mar 2024 13:49:25 +0000
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and
 MADV_COLD
Content-Language: en-GB
To: Lance Yang <ioworker0@gmail.com>
Cc: Barry Song <21cnbao@gmail.com>, Andrew Morton
 <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>,
 Matthew Wilcox <willy@infradead.org>, Huang Ying <ying.huang@intel.com>,
 Gao Xiang <xiang@kernel.org>, Yu Zhao <yuzhao@google.com>,
 Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
 Kefeng Wang <wangkefeng.wang@huawei.com>, Chris Li <chrisl@kernel.org>,
 linux-mm@kvack.org, linux-kernel@vger.kernel.org
References: <20240311150058.1122862-1-ryan.roberts@arm.com>
 <20240311150058.1122862-7-ryan.roberts@arm.com>
 <CAGsJ_4wpjqRsn7ouO=Ut9oMBLSh803=XuSPX6gJ5nQ3jyqh3hQ@mail.gmail.com>
 <a75ec640-d025-45ee-b74d-305aaa3cc1ce@arm.com>
 <CAK1f24k1AuHDdrLFNLvwdoOy=xJTVkVdfY4+SN+KW5-EiMSa9Q@mail.gmail.com>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <CAK1f24k1AuHDdrLFNLvwdoOy=xJTVkVdfY4+SN+KW5-EiMSa9Q@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 0903B180006
X-Rspam-User: 
X-Stat-Signature: 1d6a444qaxmqbgyhi6p3ayueitoq8rwj
X-Rspamd-Server: rspam01
X-HE-Tag: 1710942571-684450
X-HE-Meta: U2FsdGVkX18fjcVa/GK5bW9r0LEnR898D44f+XSMp9e5WwLPd6/GsFkmpQdgpbm3D+ZN7XMu0KfX4MmJHWgOdUhnL4G9jArkmdVsdPGEKdBJD2iYJaCXIoGNfgqmaMOJMmxxu0dPaR8SKq/85oTFfwjQtOxpnvw2k43DkVGHbkEzI7vc8Wr5RbdeQ8dpWrLRd1ntdftqAu4rgLP2B6LXqQktFcJXk6pnCkexzX+0+ZnQpBPxLsGPYLbpj9awHe8Zzaj4VTHZO9pwcnP+NKw6xscE5wHYp0dqVrGl9P1Bi5dsyWfCKuN98/BPaoa3shakFQ9crpiT145cm+2C20wJcSNDefOQymGUMtVg/EZthPMZZPf5PnekoT4g0jXeve9WkBCOc/TfVBK4s3ofSvKJORa0NL+Usb1jvS3lNbjQjrK6uh1a6yLMLWUmmDrCIxYPNGnHK47YzlF8r1cn7hbiQOzN5V94V56Uf3/QdyO61VbVGy0dALEtCK6+ngWRTq/yxQCsesezFjXGbOKCzTkT3bep6Rtf4hGN/3jk+OPuOs2jgWZ4VUEMqmrlfTCwVr83XSLDOBhDsq5H7GYrMzfr4bwWKmd1VOjM6uC+Yq60ivD0OOuH1rJ4GFL5sqAzG+tURzoe1N/vsp/9aVkm5u+z1vFRegS0ZtNMJ6H3TSOhF8Am48DpLJtDNixBoNmzkrRwIqDqyulH5WFtFjOd/EL/KX94rXZEdOPT6XqcfhCGaA1NyZRkBV5QgrJ3qJ2J0Eiu22Tc/0fUdQvAYv4j2pMOISoa1T+JL3gY0tD3mGwsivZZWRfOs17BHt3D166x8V0cY369v7lu34xYw8Q+AXhAtkl/hMaC6ZLzarX+aFA6/wTf2ZOoBRx+4v3w+CE63b/RVY6muQH1QxvhzcpXb/1WXxwng1qasHpKz1d+XJEOMDU7vuKXbW9QxSQJu/Agaw33JWK32kH/w+wAA7q/dhh
 P3s1kEv9
 oHZT6WTtIi+5BLYZyWhjgtpY+lsN2MJEttE7F+nGywspWxvifYYuFMn9Gd+2heWh1pAorBG41miG+10FZ5VH64HuAzoLUV439DPt7GoBTlbSspUi14oVlhXLTRM9jNl86EwVtxdX9qQ/MwDjMNJCkqnLWc0EeYoCpkOmsP6X/0NCaUceF3p33qQUuNwH9fiV3ZuaWJGncCZ9oSkAkw2bhkH294VAkMMja5DL55qXhg9W11pToIRRtO/7CoU44nKxz556EHhdkdUqslJTSaK8Wv/eQyWQTB5OtTCX/yAIpmxxwp5LDYEejSvGudlCC5aMHMw2FtN5clO0V6xBze95LpJBRGQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi Lance, Barry,

Sorry - I totally missed this when you originally sent it!


On 13/03/2024 14:02, Lance Yang wrote:
> On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 13/03/2024 07:19, Barry Song wrote:
>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>> range. This change means that large folios will be maintained all the
>>>> way to swap storage. This both improves performance during swap-out, by
>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>> a separate series).
>>>>
>>>> Folios that are not fully mapped in the target range are still split,
>>>> but note that behavior is changed so that if the split fails for any
>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>> next pte in the range and continue work on the proceeding folios.
>>>> Previously any failure of this sort would cause the entire operation to
>>>> give up and no folios mapped at higher addresses were paged out or made
>>>> cold. Given large folios are becoming more common, this old behavior
>>>> would have likely lead to wasted opportunities.
>>>>
>>>> While we are at it, change the code that clears young from the ptes to
>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>> where the old approach would require unfolding/refolding and the new
>>>> approach can be done in place.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>
>>> This looks so much better than our initial RFC.
>>> Thank you for your excellent work!
>>
>> Thanks - its a team effort - I had your PoC and David's previous batching work
>> to use as a template.
>>
>>>
>>>> ---
>>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>>  1 file changed, 51 insertions(+), 38 deletions(-)
>>>>
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>         LIST_HEAD(folio_list);
>>>>         bool pageout_anon_only_filter;
>>>>         unsigned int batch_count = 0;
>>>> +       int nr;
>>>>
>>>>         if (fatal_signal_pending(current))
>>>>                 return -EINTR;
>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>                 return 0;
>>>>         flush_tlb_batched_pending(mm);
>>>>         arch_enter_lazy_mmu_mode();
>>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>> +               nr = 1;
>>>>                 ptent = ptep_get(pte);
>>>>
>>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>                         continue;
>>>>
>>>>                 /*
>>>> -                * Creating a THP page is expensive so split it only if we
>>>> -                * are sure it's worth. Split it if we are only owner.
>>>> +                * If we encounter a large folio, only split it if it is not
>>>> +                * fully mapped within the range we are operating on. Otherwise
>>>> +                * leave it as is so that it can be swapped out whole. If we
>>>> +                * fail to split a folio, leave it in place and advance to the
>>>> +                * next pte in the range.
>>>>                  */
>>>>                 if (folio_test_large(folio)) {
>>>> -                       int err;
>>>> -
>>>> -                       if (folio_estimated_sharers(folio) > 1)
>>>> -                               break;
>>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> -                               break;
>>>> -                       if (!folio_trylock(folio))
>>>> -                               break;
>>>> -                       folio_get(folio);
>>>> -                       arch_leave_lazy_mmu_mode();
>>>> -                       pte_unmap_unlock(start_pte, ptl);
>>>> -                       start_pte = NULL;
>>>> -                       err = split_folio(folio);
>>>> -                       folio_unlock(folio);
>>>> -                       folio_put(folio);
>>>> -                       if (err)
>>>> -                               break;
>>>> -                       start_pte = pte =
>>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> -                       if (!start_pte)
>>>> -                               break;
>>>> -                       arch_enter_lazy_mmu_mode();
>>>> -                       pte--;
>>>> -                       addr -= PAGE_SIZE;
>>>> -                       continue;
>>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>> +                                               FPB_IGNORE_SOFT_DIRTY;
>>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
>>>> +
>>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>> +                                            fpb_flags, NULL);
>>>
>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>> are doing madvise() on a portion of a large folio.
>>
>> Good idea. Something like this?:
>>
>>         if (pte_pfn(pte) == folio_pfn(folio)
>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>                                      fpb_flags, NULL);
>>
>> If we are not mapping the first page of the folio, then it can't be a full
>> mapping, so no need to call folio_pte_batch(). Just split it.
> 
>                  if (folio_test_large(folio)) {
> [...]
>                        nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>                                             fpb_flags, NULL);
> +                       if (folio_estimated_sharers(folio) > 1)
> +                               continue;
> 
> Could we use folio_estimated_sharers as an early exit point here?

I'm not sure what this is saving where you have it? Did you mean to put it
before folio_pte_batch()? Currently it is just saving a single conditional.

But now that I think about it a bit more, I remember why I was originally
unconditionally calling folio_pte_batch(). Given its a large folio, if the split
fails, we can move the cursor to the pte where the next folio begins so we don't
have to iterate through one pte at a time which would cause us to keep calling
folio_estimated_sharers(), folio_test_anon(), etc on the same folio until we get
to the next boundary.

Of course the common case at this point will be for the split to succeed, but
then we are going to iterate over ever single PTE anyway - one way or another
they are all fetched into cache. So I feel like its neater not to add the
conditionals for calling folio_pte_batch(), and just leave this as I have it here.

> 
>                        if (nr < folio_nr_pages(folio)) {
>                                int err;
> 
> -                               if (folio_estimated_sharers(folio) > 1)
> -                                       continue;
> [...]
> 
>>
>>>
>>>> +
>>>> +                       if (nr < folio_nr_pages(folio)) {
>>>> +                               int err;
>>>> +
>>>> +                               if (folio_estimated_sharers(folio) > 1)
>>>> +                                       continue;
>>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> +                                       continue;
>>>> +                               if (!folio_trylock(folio))
>>>> +                                       continue;
>>>> +                               folio_get(folio);
>>>> +                               arch_leave_lazy_mmu_mode();
>>>> +                               pte_unmap_unlock(start_pte, ptl);
>>>> +                               start_pte = NULL;
>>>> +                               err = split_folio(folio);
>>>> +                               folio_unlock(folio);
>>>> +                               folio_put(folio);
>>>> +                               if (err)
>>>> +                                       continue;
>>>> +                               start_pte = pte =
>>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> +                               if (!start_pte)
>>>> +                                       break;
>>>> +                               arch_enter_lazy_mmu_mode();
>>>> +                               nr = 0;
>>>> +                               continue;
>>>> +                       }
>>>>                 }
>>>>
>>>>                 /*
>>>>                  * Do not interfere with other mappings of this folio and
>>>> -                * non-LRU folio.
>>>> +                * non-LRU folio. If we have a large folio at this point, we
>>>> +                * know it is fully mapped so if its mapcount is the same as its
>>>> +                * number of pages, it must be exclusive.
>>>>                  */
>>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>> +               if (!folio_test_lru(folio) ||
>>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>>>>                         continue;
>>>
>>> This looks so perfect and is exactly what I wanted to achieve.
>>>
>>>>
>>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>                         continue;
>>>>
>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>> -
>>>> -               if (!pageout && pte_young(ptent)) {
>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>> -                                                       tlb->fullmm);
>>>> -                       ptent = pte_mkold(ptent);
>>>> -                       set_pte_at(mm, addr, pte, ptent);
>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>> +               if (!pageout) {
>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> 
> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
> pte clearing?

Sorry Lance, I don't understand this question, can you rephrase? Are you saying
there is a good reason to do the original clear-mkold-set for some arches?

> 
> Thanks,
> Lance
> 
> 
> 
>>>> +                       }
>>>
>>> This looks so smart. if it is not pageout, we have increased pte
>>> and addr here; so nr is 0 and we don't need to increase again in
>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>
>>> otherwise, nr won't be 0. so we will increase addr and
>>> pte by nr.
>>
>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>> madvise_free_pte_range().
>>
>>
>>>
>>>
>>>>                 }
>>>>
>>>>                 /*
>>>> --
>>>> 2.25.1
>>>>
>>>
>>> Overall, LGTM,
>>>
>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>
>> Thanks!
>>
>>