From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C284C54E58 for ; Wed, 13 Mar 2024 10:37:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C95538001F; Wed, 13 Mar 2024 06:37:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C4578940010; Wed, 13 Mar 2024 06:37:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B0D718001F; Wed, 13 Mar 2024 06:37:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A1434940010 for ; Wed, 13 Mar 2024 06:37:35 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 2C07D140EBD for ; Wed, 13 Mar 2024 10:37:35 +0000 (UTC) X-FDA: 81891664470.06.9BB6688 Received: from mail-ua1-f54.google.com (mail-ua1-f54.google.com [209.85.222.54]) by imf17.hostedemail.com (Postfix) with ESMTP id 8139640012 for ; Wed, 13 Mar 2024 10:37:32 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Bvc819fX; spf=pass (imf17.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710326252; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cCYFlkZ7ngILjlabGFnEccTfsaoVeZ0IKreojfnJ8PE=; b=aOXkHJP6XEKckLO4Gq50LDPkJcHutQ0iSO0RjMKhsbEFXuft1qr5T/VqXWVfHwd9urrpu7 mdX70/ZkUSfqQfDnUlMmwVxL9nK9o4SRO4e/OuHnwWAaJeXJ9K2FponC/VKCoHFipdmqb/ FVKMezvAgVqGxG5lFa0SAkroskRAFz8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710326252; a=rsa-sha256; cv=none; b=ZTf+El6qvlRROONZ8cXQxUuXHKEgIzuiPCK2CFygxF5UHk8+oi/6nKJC6YfhFQxpfLgkVO YKV41DCpQR3hod8A+3DAHCVb+joAbrkKyhOoY+unKsK3G0aO0Pv4h+1nVN8Ognv5jHb70Q 2WFnyC/tY6miKwjXYkaqjf3vvqbfth8= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Bvc819fX; spf=pass (imf17.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ua1-f54.google.com with SMTP id a1e0cc1a2514c-7dbb7f8016eso437177241.1 for ; Wed, 13 Mar 2024 03:37:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1710326251; x=1710931051; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=cCYFlkZ7ngILjlabGFnEccTfsaoVeZ0IKreojfnJ8PE=; b=Bvc819fX8wz6DGD37CMYrOqAWgYxk6mSqWu3viu8dBeN6Pj7Q8TS0KaI83yFKBcqzw QEUFnRhHI0UoxZIwusPkQQPNYE98AZ5SLWFKmT1JqlbShQn3uv6JIVxqLJI4ksi+IFTR xzeGwUwX3Xynw7Wq99BoOsyZO88E9Z7C9grEfmdL1tltYFnK3mD+FYPh9i2bWrwPuAGC P9fhvBq/9+LJnk2Jx00UcFQFvzhZUBsj/kqRpDTfp0x4K2IFGe4UNF++8aAE86aJxu1i LfkzQFXMS9swTW67r+cdTFXiFFbbC13qO1qA0i9NFlgqkUD8Ae1M8tZGUm8Awcz+So49 NbTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710326251; x=1710931051; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cCYFlkZ7ngILjlabGFnEccTfsaoVeZ0IKreojfnJ8PE=; b=pm9FvzwWqhHBprzY3mHook9x3Ff3M9pWndqdS1OPR1SYJLXozb6QD2isaIIq+Uxo4n WaHcpYTxiQNnkJOTIoXWnA+5HKSnVloKSJE4giywivMBFAfjKcSZqUCxlmzd4lWEbr7+ ys5BV9gbWlJrsUZ/rxj30LRpCGCFdPay9cp4P0AvKly/6eOTwvSa4De+masJoSMLzF5c jFqzQI7JFdlXATf0xcRccZptNUQpqQrj2SQomKJ4ZDCCz7uz1CfNW25CCO+CdZb4OZ29 NWaZohp/BSiVd1BYcOw0dcnxCIm3rp+vt/FTF1gCh0LqpxTsuQI15T/Ghv7gIIgzLTTi cEmg== X-Forwarded-Encrypted: i=1; AJvYcCWcU3li039YQlgwdjrFphx6M6JzianTtOBZ8XD3Kln4rr1sxnusQtZzS7wLXaWPBZHeCQxPxBSUhYWxni+XIow8OEE= X-Gm-Message-State: AOJu0YycYIic49B/qORe4BcvcNL8L5+vlRLJQgil/1JuGljNYbDxbNAl VEefuQNU0nYiWZtDt6Ag0W9L0xGCRJc7dAButZrPlyRSHEK/klWudMwdhGOboXB4NIteimwyE9v u6D0U+mewNFWSjCbQI2fubkaQ71U= X-Google-Smtp-Source: AGHT+IEelr0D88mw7GjjdiDzBCc5iIgK/be9AwLVn+7bjZsDaYGkyFeadHgCw5YMgzA+ryEtl5cSmb6mLYUsSByEW6g= X-Received: by 2002:a67:f288:0:b0:474:c786:87a4 with SMTP id m8-20020a67f288000000b00474c78687a4mr1478158vsk.3.1710326251332; Wed, 13 Mar 2024 03:37:31 -0700 (PDT) MIME-Version: 1.0 References: <20240311150058.1122862-1-ryan.roberts@arm.com> <20240311150058.1122862-7-ryan.roberts@arm.com> <00a3ba1d-98e1-409b-ae6e-7fbcbdcd74d5@arm.com> In-Reply-To: <00a3ba1d-98e1-409b-ae6e-7fbcbdcd74d5@arm.com> From: Barry Song <21cnbao@gmail.com> Date: Wed, 13 Mar 2024 23:37:19 +1300 Message-ID: Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD To: Ryan Roberts Cc: Lance Yang , Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Chris Li , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 8139640012 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: oum4exchc9jri8pfkbtpobat3yrd6p4b X-HE-Tag: 1710326252-286679 X-HE-Meta: U2FsdGVkX19LNz9w2TpQV9wOzu4hiO1+KtakuL8aN6SXbfQQl+t7KHtLoCEh3hpL7jOx0a9NECh9EHj01lRjgPuww/R9/XWp52L4jsHakh2vC06fVg47Fi3DJ8D7zXP+R+uv1Q+2hU+olZmaaiv42CxVYuHRUDyIFfHVfl3Ms78ytgtwleoADqH6T89o7d8+cdAck+lJoezPUkVR0CDlFkoKC7hwmbKlZxKDYqvO3/G8fh8KwFOGjvUuhycv2s+w4D+4OmtN+sLMX367Ea9GIXLGmh2vYQgDWdzXF6OuJSjquk8UPHS+P1vMCViiUZC2sn4dgv39vHKzMUI1+JxFE5JTnh5exH7BQd6XaqWJhZtmPovjWjRQzi8N2WJ4ja9cD4VVtkQ5M516gJAovkmQgtNqQ6FWvusOS/n/cRxwaZlkNQsC94UtGvxroe5F2spaQUnGtMQjPa16kLP1oC6w1gU0sqdvKGWUHfTsgCxDIM/0urDeF5THP92EKiQSojEHXzhWc1L2raXNKa3dou6/Q+26AqH5B0h797P+7PVuSGwqXmMybIy6B9yikUclt1eS0qtOJHMLTefky/G+uoxE+QgZ7UYIhGDUGv9wtBXGFier231F3u7zGFh6wSsGpSgk2QvZrFV1TYIj9Z2WDv2lNTnoKh+OQPV2f1mvraPobgGFwGuu1G5KTmD7OA/bDDClOWrnHLD2YffW4NGO6Pwcy5HBuEZofnKXrhmNVyV1V0TZIe9Ntd/ucbchgzY4ID8+aliyKLKJMZPXge9qPwIR3/SZs6LDD1RqSA0XtjzMj9spOBhr7KxvF0GzGhvBKJgYcEXJYI9XxcTL5YJoS8Iwnma45gMw/eLXvXzxXFRtYQG7Nnb+S2dsNoqzv64vgMKfde172KMr7dNaTpi3QSKiPFg4v04etTOAV2zJAGFo2/tf9ZzFyhNdpFQDo3wkoSjR5UwsDuxtckyqSdgWIRL 2ZBd4Qj8 h3iiCxSkhFUni5pZkwx0XSEz1bo9W0ErBZmfNIdQosA3rRqYQpkFRYaxEtcwrB8bgALNT/TRky8ACksSWJ2iskoj2SA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Mar 13, 2024 at 10:36=E2=80=AFPM Ryan Roberts wrote: > > On 13/03/2024 09:16, Barry Song wrote: > > On Wed, Mar 13, 2024 at 10:03=E2=80=AFPM Ryan Roberts wrote: > >> > >> On 13/03/2024 07:19, Barry Song wrote: > >>> On Tue, Mar 12, 2024 at 4:01=E2=80=AFAM Ryan Roberts wrote: > >>>> > >>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any la= rge > >>>> folio that is fully and contiguously mapped in the pageout/cold vm > >>>> range. This change means that large folios will be maintained all th= e > >>>> way to swap storage. This both improves performance during swap-out,= by > >>>> eliding the cost of splitting the folio, and sets us up nicely for > >>>> maintaining the large folio when it is swapped back in (to be covere= d in > >>>> a separate series). > >>>> > >>>> Folios that are not fully mapped in the target range are still split= , > >>>> but note that behavior is changed so that if the split fails for any > >>>> reason (folio locked, shared, etc) we now leave it as is and move to= the > >>>> next pte in the range and continue work on the proceeding folios. > >>>> Previously any failure of this sort would cause the entire operation= to > >>>> give up and no folios mapped at higher addresses were paged out or m= ade > >>>> cold. Given large folios are becoming more common, this old behavior > >>>> would have likely lead to wasted opportunities. > >>>> > >>>> While we are at it, change the code that clears young from the ptes = to > >>>> use ptep_test_and_clear_young(), which is more efficent than > >>>> get_and_clear/modify/set, especially for contpte mappings on arm64, > >>>> where the old approach would require unfolding/refolding and the new > >>>> approach can be done in place. > >>>> > >>>> Signed-off-by: Ryan Roberts > >>> > >>> This looks so much better than our initial RFC. > >>> Thank you for your excellent work! > >> > >> Thanks - its a team effort - I had your PoC and David's previous batch= ing work > >> to use as a template. > >> > >>> > >>>> --- > >>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++-------------------= --- > >>>> 1 file changed, 51 insertions(+), 38 deletions(-) > >>>> > >>>> diff --git a/mm/madvise.c b/mm/madvise.c > >>>> index 547dcd1f7a39..56c7ba7bd558 100644 > >>>> --- a/mm/madvise.c > >>>> +++ b/mm/madvise.c > >>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd= _t *pmd, > >>>> LIST_HEAD(folio_list); > >>>> bool pageout_anon_only_filter; > >>>> unsigned int batch_count =3D 0; > >>>> + int nr; > >>>> > >>>> if (fatal_signal_pending(current)) > >>>> return -EINTR; > >>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd= _t *pmd, > >>>> return 0; > >>>> flush_tlb_batched_pending(mm); > >>>> arch_enter_lazy_mmu_mode(); > >>>> - for (; addr < end; pte++, addr +=3D PAGE_SIZE) { > >>>> + for (; addr < end; pte +=3D nr, addr +=3D nr * PAGE_SIZE) { > >>>> + nr =3D 1; > >>>> ptent =3D ptep_get(pte); > >>>> > >>>> if (++batch_count =3D=3D SWAP_CLUSTER_MAX) { > >>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(p= md_t *pmd, > >>>> continue; > >>>> > >>>> /* > >>>> - * Creating a THP page is expensive so split it only= if we > >>>> - * are sure it's worth. Split it if we are only owne= r. > >>>> + * If we encounter a large folio, only split it if i= t is not > >>>> + * fully mapped within the range we are operating on= . Otherwise > >>>> + * leave it as is so that it can be swapped out whol= e. If we > >>>> + * fail to split a folio, leave it in place and adva= nce to the > >>>> + * next pte in the range. > >>>> */ > >>>> if (folio_test_large(folio)) { > >>>> - int err; > >>>> - > >>>> - if (folio_estimated_sharers(folio) > 1) > >>>> - break; > >>>> - if (pageout_anon_only_filter && !folio_test_= anon(folio)) > >>>> - break; > >>>> - if (!folio_trylock(folio)) > >>>> - break; > >>>> - folio_get(folio); > >>>> - arch_leave_lazy_mmu_mode(); > >>>> - pte_unmap_unlock(start_pte, ptl); > >>>> - start_pte =3D NULL; > >>>> - err =3D split_folio(folio); > >>>> - folio_unlock(folio); > >>>> - folio_put(folio); > >>>> - if (err) > >>>> - break; > >>>> - start_pte =3D pte =3D > >>>> - pte_offset_map_lock(mm, pmd, addr, &= ptl); > >>>> - if (!start_pte) > >>>> - break; > >>>> - arch_enter_lazy_mmu_mode(); > >>>> - pte--; > >>>> - addr -=3D PAGE_SIZE; > >>>> - continue; > >>>> + const fpb_t fpb_flags =3D FPB_IGNORE_DIRTY | > >>>> + FPB_IGNORE_SOFT_DIRT= Y; > >>>> + int max_nr =3D (end - addr) / PAGE_SIZE; > >>>> + > >>>> + nr =3D folio_pte_batch(folio, addr, pte, pte= nt, max_nr, > >>>> + fpb_flags, NULL); > >>> > >>> I wonder if we have a quick way to avoid folio_pte_batch() if users > >>> are doing madvise() on a portion of a large folio. > >> > >> Good idea. Something like this?: > >> > >> if (pte_pfn(pte) =3D=3D folio_pfn(folio) > > > > what about > > > > "If (pte_pfn(pte) =3D=3D folio_pfn(folio) && max_nr >=3D nr_pages)" > > > > just to account for cases where the user's end address falls within > > the middle of a large folio? > > yes, even better. I'll add this for the next version. > > > > > > > BTW, another minor issue is here: > > > > if (++batch_count =3D=3D SWAP_CLUSTER_MAX) { > > batch_count =3D 0; > > if (need_resched()) { > > arch_leave_lazy_mmu_mode(); > > pte_unmap_unlock(start_pte, ptl); > > cond_resched(); > > goto restart; > > } > > } > > > > We are increasing 1 for nr ptes, thus, we are holding PTL longer > > than small folios case? we used to increase 1 for each PTE. > > Does it matter? > > I thought about that, but the vast majority of the work is per-folio, not > per-pte. So I concluded it would be best to continue to increment per-fol= io. Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()") primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT and MADV_COLD are much less critical compared to other scenarios where operations like do_anon_page or do_swap_page necessarily need PTL to progress. Therefore, adopting an approach that relatively aggressively releases the PTL seems to neither harm MADV_PAGEOUT/COLD nor disadvantage others. We are slightly increasing the duration of holding the PTL due to the iteration of folio_pte_batch() potentially taking longer than the case of small folios, which do not require it. However, compared to operations like folio_isolate_lru() and folio_deactivate(), this increase seems negligible. Recently, we have actually removed ptep_test_and_clear_young() for MADV_PAGEOUT, which should also benefit real-time scenarios. Nonetheless, there is a small risk with large folios, such as 1 MiB mTHP, where we may need to loop 256 times in folio_pte_batch(). I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than 1 for two reasons: 1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are improving them by reducing the time taken to put the same number of pages into the reclaim list. 2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that genuinely require the PTL to progress. Moreover, the majority of time spent on PAGEOUT is actually reclaim_pages(). > > > >> nr =3D folio_pte_batch(folio, addr, pte, ptent, max_nr= , > >> fpb_flags, NULL); > >> > >> If we are not mapping the first page of the folio, then it can't be a = full > >> mapping, so no need to call folio_pte_batch(). Just split it. > >> > >>> > >>>> + > >>>> + if (nr < folio_nr_pages(folio)) { > >>>> + int err; > >>>> + > >>>> + if (folio_estimated_sharers(folio) >= 1) > >>>> + continue; > >>>> + if (pageout_anon_only_filter && !fol= io_test_anon(folio)) > >>>> + continue; > >>>> + if (!folio_trylock(folio)) > >>>> + continue; > >>>> + folio_get(folio); > >>>> + arch_leave_lazy_mmu_mode(); > >>>> + pte_unmap_unlock(start_pte, ptl); > >>>> + start_pte =3D NULL; > >>>> + err =3D split_folio(folio); > >>>> + folio_unlock(folio); > >>>> + folio_put(folio); > >>>> + if (err) > >>>> + continue; > >>>> + start_pte =3D pte =3D > >>>> + pte_offset_map_lock(mm, pmd,= addr, &ptl); > >>>> + if (!start_pte) > >>>> + break; > >>>> + arch_enter_lazy_mmu_mode(); > >>>> + nr =3D 0; > >>>> + continue; > >>>> + } > >>>> } > >>>> > >>>> /* > >>>> * Do not interfere with other mappings of this foli= o and > >>>> - * non-LRU folio. > >>>> + * non-LRU folio. If we have a large folio at this p= oint, we > >>>> + * know it is fully mapped so if its mapcount is the= same as its > >>>> + * number of pages, it must be exclusive. > >>>> */ > >>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) = !=3D 1) > >>>> + if (!folio_test_lru(folio) || > >>>> + folio_mapcount(folio) !=3D folio_nr_pages(folio)= ) > >>>> continue; > >>> > >>> This looks so perfect and is exactly what I wanted to achieve. > >>> > >>>> > >>>> if (pageout_anon_only_filter && !folio_test_anon(fol= io)) > >>>> continue; > >>>> > >>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > >>>> - > >>>> - if (!pageout && pte_young(ptent)) { > >>>> - ptent =3D ptep_get_and_clear_full(mm, addr, = pte, > >>>> - tlb->fullmm)= ; > >>>> - ptent =3D pte_mkold(ptent); > >>>> - set_pte_at(mm, addr, pte, ptent); > >>>> - tlb_remove_tlb_entry(tlb, pte, addr); > >>>> + if (!pageout) { > >>>> + for (; nr !=3D 0; nr--, pte++, addr +=3D PAG= E_SIZE) { > >>>> + if (ptep_test_and_clear_young(vma, a= ddr, pte)) > >>>> + tlb_remove_tlb_entry(tlb, pt= e, addr); > >>>> + } > >>> > >>> This looks so smart. if it is not pageout, we have increased pte > >>> and addr here; so nr is 0 and we don't need to increase again in > >>> for (; addr < end; pte +=3D nr, addr +=3D nr * PAGE_SIZE) > >>> > >>> otherwise, nr won't be 0. so we will increase addr and > >>> pte by nr. > >> > >> Indeed. I'm hoping that Lance is able to follow a similar pattern for > >> madvise_free_pte_range(). > >> > >> > >>> > >>> > >>>> } > >>>> > >>>> /* > >>>> -- > >>>> 2.25.1 > >>>> > >>> > >>> Overall, LGTM, > >>> > >>> Reviewed-by: Barry Song > >> > >> Thanks! > >> Thanks Barry