Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: yangge1116 <yangge1116@126.com>
To: Barry Song <21cnbao@gmail.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, baolin.wang@linux.alibaba.com,
	liuzixing@hygon.cn
Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page
Date: Tue, 18 Jun 2024 13:49:57 +0800	[thread overview]
Message-ID: <5177ce6d-6c75-482e-4fa8-749972f08ffe@126.com> (raw)
In-Reply-To: <CAGsJ_4ytYTpvRVgR1EoazsH=QxZCDE2e8H0BeXrY-6zWFD0kCg@mail.gmail.com>



在 2024/6/18 下午12:10, Barry Song 写道:
> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote:
>>
>>
>>
>> 在 2024/6/18 上午9:55, Barry Song 写道:
>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>>>
>>>>>
>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>>>>>>
>>>>>>> From: yangge <yangge1116@126.com>
>>>>>>>
>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations") no longer differentiates the migration type
>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>>>>>> the list, in some cases, it's not acceptable, for example, allocating
>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>
>>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>>>>>> PCP list to avoid the issue above.
>>>>>>
>>>>>> Could you please describe the impact on users in the commit log?
>>>>>
>>>>> If a large number of CMA memory are configured in the system (for
>>>>> example, the CMA memory accounts for 50% of the system memory), starting
>>>>> virtual machine with device passthrough will get stuck.
>>>>>
>>>>> During starting virtual machine, it will call pin_user_pages_remote(...,
>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
>>>>> return CMA memory, pin_user_pages_remote() will enter endless loops.
>>>>>
>>>>> backtrace:
>>>>> pin_user_pages_remote
>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>> --------__get_user_pages_locked
>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>> to migrate
>>>>> ------------migrate_longterm_unpinnable_pages
>>>>> ----------------alloc_migration_target // non-movable allocation
>>>>>
>>>>>> Is it possible that some CMA memory might be used by non-movable
>>>>>> allocation requests?
>>>>>
>>>>> Yes.
>>>>>
>>>>>
>>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
>>>>>> to fail?
>>>>>
>>>>>
>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>>>> non-movable allocation requests return CMA memory,
>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
>>>>> CMA page, which is useless and cause endless loops in
>>>>> __gup_longterm_locked().
>>>
>>> This is only one perspective. We also need to consider the impact on
>>> CMA itself. For example,
>>> when CMA is borrowed by THP, and we need to reclaim it through
>>> cma_alloc() or dma_alloc_coherent(),
>>> we must move those pages out to ensure CMA's users can retrieve that
>>> contiguous memory.
>>>
>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
>>> can't relocate them.
>>> As a result, cma_alloc() is more likely to fail.
>>>
>>>>>
>>>>> backtrace:
>>>>> pin_user_pages_remote
>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>> --------__get_user_pages_locked
>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>> to migrate
>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations")
>>>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>>>> ---
>>>>>>>     mm/page_alloc.c | 10 ++++++++++
>>>>>>>     1 file changed, 10 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>> *preferred_zone,
>>>>>>>            WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>>>>>
>>>>>>>            if (likely(pcp_allowed_order(order))) {
>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>>>> ALLOC_CMA ||
>>>>>>> +                                               order !=
>>>>>>> HPAGE_PMD_ORDER) {
>>>>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
>>>>>>> order,
>>>>>>> +                                               migratetype,
>>>>>>> alloc_flags);
>>>>>>> +                       if (likely(page))
>>>>>>> +                               goto out;
>>>>>>> +               }
>>>>>>
>>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
>>>>>> But it
>>>>>> still seems better than causing the failure of CMA allocation.
>>>>>>
>>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from
>>>>>> the first
>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>>>
>>>>
>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>>>> adding CMA THP into pcp may incur a slight performance penalty.
>>>>
>>>
>>> But the majority of movable pages aren't CMA, right?
>>
>>> Do we have an estimate for
>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
>>> the original intention for THP was to avoid by having only one PCP[1]?
>>>
>>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
>>>
>>
>> The size of struct per_cpu_pages is 256 bytes in current code containing
>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>> allocations").
>> crash> struct per_cpu_pages
>> struct per_cpu_pages {
>>       spinlock_t lock;
>>       int count;
>>       int high;
>>       int high_min;
>>       int high_max;
>>       int batch;
>>       u8 flags;
>>       u8 alloc_factor;
>>       u8 expire;
>>       short free_count;
>>       struct list_head lists[13];
>> }
>> SIZE: 256
>>
>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list
>> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
>> crash> struct per_cpu_pages
>> struct per_cpu_pages {
>>       spinlock_t lock;
>>       int count;
>>       int high;
>>       int high_min;
>>       int high_max;
>>       int batch;
>>       u8 flags;
>>       u8 alloc_factor;
>>       u8 expire;
>>       short free_count;
>>       struct list_head lists[15];
>> }
>> SIZE: 272
>>
>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>> THP-sized allocations") decrease one cacheline.
> 
> the proposal is not reverting the patch but adding one CMA pcp.
> so it is "struct list_head lists[14]"; in this case, the size is still
> 256?
> 

Yes, if only add one CMA pcp, the size is still 256. Seems adding one 
CMA pcp is more reasonable.

> 
>>
>>>
>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>>>> refer to it.
>>>>
>>>>
>>>>>>> +#else
>>>>>>>                    page = rmqueue_pcplist(preferred_zone, zone, order,
>>>>>>>                                           migratetype, alloc_flags);
>>>>>>>                    if (likely(page))
>>>>>>>                            goto out;
>>>>>>> +#endif
>>>>>>>            }
>>>>>>>
>>>>>>>            page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>
>>>>>> Thanks
>>>>>> Barry
>>>>>>
>>>>
>>>>
>>

next prev parent reply	other threads:[~2024-06-18  5:50 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-04  9:14 yangge1116
2024-06-04 12:01 ` Baolin Wang
2024-06-04 12:36   ` yangge1116
2024-06-06  3:06     ` Baolin Wang
2024-06-06  9:10       ` yangge1116
2024-06-17 10:43       ` Barry Song
2024-06-17 11:36         ` Baolin Wang
2024-06-17 11:55           ` Barry Song
2024-06-18  3:31             ` yangge1116
2024-06-17 10:26 ` Barry Song
2024-06-17 12:47   ` yangge1116
2024-06-18  1:34     ` yangge1116
2024-06-18  1:55       ` Barry Song
2024-06-18  3:31         ` yangge1116
2024-06-18  4:10           ` Barry Song
2024-06-18  5:49             ` yangge1116 [this message]
2024-06-18  6:55             ` yangge1116
2024-06-18  6:58               ` Barry Song
2024-06-18  7:51                 ` yangge1116
2024-06-19  5:34                   ` Ge Yang
2024-06-19  8:20                     ` Barry Song
2024-06-19  8:35                       ` Ge Yang
2024-06-18  3:40         ` yangge1116

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5177ce6d-6c75-482e-4fa8-749972f08ffe@126.com \
    --to=yangge1116@126.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuzixing@hygon.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox