Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ge Yang <yangge1116@126.com>
To: Barry Song <21cnbao@gmail.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, baolin.wang@linux.alibaba.com,
	liuzixing@hygon.cn
Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page
Date: Wed, 19 Jun 2024 16:35:41 +0800	[thread overview]
Message-ID: <9a636c96-7279-4a84-9dbc-5ab4f262ab96@126.com> (raw)
In-Reply-To: <CAGsJ_4xh3Bsd8RZ9v8Am=TmFWPfo_T4UVgptq4gVH9=QOHnDvw@mail.gmail.com>



在 2024/6/19 16:20, Barry Song 写道:
> On Wed, Jun 19, 2024 at 5:35 PM Ge Yang <yangge1116@126.com> wrote:
>>
>>
>>
>> 在 2024/6/18 15:51, yangge1116 写道:
>>>
>>>
>>> 在 2024/6/18 下午2:58, Barry Song 写道:
>>>> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@126.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> 在 2024/6/18 下午12:10, Barry Song 写道:
>>>>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/6/18 上午9:55, Barry Song 写道:
>>>>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> From: yangge <yangge1116@126.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP
>>>>>>>>>>>> list for
>>>>>>>>>>>> THP-sized allocations") no longer differentiates the migration
>>>>>>>>>>>> type
>>>>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA
>>>>>>>>>>>> page from
>>>>>>>>>>>> the list, in some cases, it's not acceptable, for example,
>>>>>>>>>>>> allocating
>>>>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>>>>>>
>>>>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from
>>>>>>>>>>>> THP-sized
>>>>>>>>>>>> PCP list to avoid the issue above.
>>>>>>>>>>>
>>>>>>>>>>> Could you please describe the impact on users in the commit log?
>>>>>>>>>>
>>>>>>>>>> If a large number of CMA memory are configured in the system (for
>>>>>>>>>> example, the CMA memory accounts for 50% of the system memory),
>>>>>>>>>> starting
>>>>>>>>>> virtual machine with device passthrough will get stuck.
>>>>>>>>>>
>>>>>>>>>> During starting virtual machine, it will call
>>>>>>>>>> pin_user_pages_remote(...,
>>>>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>>>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to
>>>>>>>>>> non-CMA
>>>>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation
>>>>>>>>>> requests
>>>>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless
>>>>>>>>>> loops.
>>>>>>>>>>
>>>>>>>>>> backtrace:
>>>>>>>>>> pin_user_pages_remote
>>>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>>>>> --------__get_user_pages_locked
>>>>>>>>>> --------check_and_migrate_movable_pages //always check fail and
>>>>>>>>>> continue
>>>>>>>>>> to migrate
>>>>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>>>> ----------------alloc_migration_target // non-movable allocation
>>>>>>>>>>
>>>>>>>>>>> Is it possible that some CMA memory might be used by non-movable
>>>>>>>>>>> allocation requests?
>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> If so, will CMA somehow become unable to migrate, causing
>>>>>>>>>>> cma_alloc()
>>>>>>>>>>> to fail?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>>>>>>>>> non-movable allocation requests return CMA memory,
>>>>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to
>>>>>>>>>> another
>>>>>>>>>> CMA page, which is useless and cause endless loops in
>>>>>>>>>> __gup_longterm_locked().
>>>>>>>>
>>>>>>>> This is only one perspective. We also need to consider the impact on
>>>>>>>> CMA itself. For example,
>>>>>>>> when CMA is borrowed by THP, and we need to reclaim it through
>>>>>>>> cma_alloc() or dma_alloc_coherent(),
>>>>>>>> we must move those pages out to ensure CMA's users can retrieve that
>>>>>>>> contiguous memory.
>>>>>>>>
>>>>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
>>>>>>>> can't relocate them.
>>>>>>>> As a result, cma_alloc() is more likely to fail.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> backtrace:
>>>>>>>>>> pin_user_pages_remote
>>>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>>>>> --------__get_user_pages_locked
>>>>>>>>>> --------check_and_migrate_movable_pages //always check fail and
>>>>>>>>>> continue
>>>>>>>>>> to migrate
>>>>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>>>>>>> THP-sized allocations")
>>>>>>>>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>       mm/page_alloc.c | 10 ++++++++++
>>>>>>>>>>>>       1 file changed, 10 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>>>>>>> *preferred_zone,
>>>>>>>>>>>>              WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order
>>>>>>>>>>>>> 1));
>>>>>>>>>>>>
>>>>>>>>>>>>              if (likely(pcp_allowed_order(order))) {
>>>>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>>>>>>>>> ALLOC_CMA ||
>>>>>>>>>>>> +                                               order !=
>>>>>>>>>>>> HPAGE_PMD_ORDER) {
>>>>>>>>>>>> +                       page = rmqueue_pcplist(preferred_zone,
>>>>>>>>>>>> zone,
>>>>>>>>>>>> order,
>>>>>>>>>>>> +                                               migratetype,
>>>>>>>>>>>> alloc_flags);
>>>>>>>>>>>> +                       if (likely(page))
>>>>>>>>>>>> +                               goto out;
>>>>>>>>>>>> +               }
>>>>>>>>>>>
>>>>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use
>>>>>>>>>>> PCP.
>>>>>>>>>>> But it
>>>>>>>>>>> still seems better than causing the failure of CMA allocation.
>>>>>>>>>>>
>>>>>>>>>>> Is there a possible approach to avoiding adding CMA THP into
>>>>>>>>>>> pcp from
>>>>>>>>>>> the first
>>>>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>>>>>>>>> adding CMA THP into pcp may incur a slight performance penalty.
>>>>>>>>>
>>>>>>>>
>>>>>>>> But the majority of movable pages aren't CMA, right?
>>>>>>>
>>>>>>>> Do we have an estimate for
>>>>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new
>>>>>>>> cacheline, which
>>>>>>>> the original intention for THP was to avoid by having only one
>>>>>>>> PCP[1]?
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
>>>>>>>>
>>>>>>>
>>>>>>> The size of struct per_cpu_pages is 256 bytes in current code
>>>>>>> containing
>>>>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized
>>>>>>> allocations").
>>>>>>> crash> struct per_cpu_pages
>>>>>>> struct per_cpu_pages {
>>>>>>>         spinlock_t lock;
>>>>>>>         int count;
>>>>>>>         int high;
>>>>>>>         int high_min;
>>>>>>>         int high_max;
>>>>>>>         int batch;
>>>>>>>         u8 flags;
>>>>>>>         u8 alloc_factor;
>>>>>>>         u8 expire;
>>>>>>>         short free_count;
>>>>>>>         struct list_head lists[13];
>>>>>>> }
>>>>>>> SIZE: 256
>>>>>>>
>>>>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP
>>>>>>> list
>>>>>>> for THP-sized allocations"), the size of struct per_cpu_pages is
>>>>>>> 272 bytes.
>>>>>>> crash> struct per_cpu_pages
>>>>>>> struct per_cpu_pages {
>>>>>>>         spinlock_t lock;
>>>>>>>         int count;
>>>>>>>         int high;
>>>>>>>         int high_min;
>>>>>>>         int high_max;
>>>>>>>         int batch;
>>>>>>>         u8 flags;
>>>>>>>         u8 alloc_factor;
>>>>>>>         u8 expire;
>>>>>>>         short free_count;
>>>>>>>         struct list_head lists[15];
>>>>>>> }
>>>>>>> SIZE: 272
>>>>>>>
>>>>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations") decrease one cacheline.
>>>>>>
>>>>>> the proposal is not reverting the patch but adding one CMA pcp.
>>>>>> so it is "struct list_head lists[14]"; in this case, the size is still
>>>>>> 256?
>>>>>>
>>>>>
>>>>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP
>>>>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
>>>>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that
>>>>> right?
>>>>
>>>> i am not quite sure about MIGRATE_RECLAIMABLE as we want to
>>>> CMA is only used by movable.
>>>> So it might be:
>>>> MOVABLE and NON-MOVABLE.
>>>
>>> One PCP list is used by UNMOVABLE pages, and the other PCP list is used
>>> by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains
>>> MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP
>>> list contains MIGRATE_MOVABLE pages.
>>>
>>
>> Is the following modification feasiable?
>>
>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -#define NR_PCP_THP 1
>> +#define NR_PCP_THP 2
>> +#define PCP_THP_MOVABLE 0
>> +#define PCP_THP_UNMOVABLE 1
>>    #else
>>    #define NR_PCP_THP 0
>>    #endif
>>
>>    static inline unsigned int order_to_pindex(int migratetype, int order)
>>    {
>> +       int pcp_type = migratetype;
>> +
>>    #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>           if (order > PAGE_ALLOC_COSTLY_ORDER) {
>>                   VM_BUG_ON(order != HPAGE_PMD_ORDER);
>> -               return NR_LOWORDER_PCP_LISTS;
>> +
>> +               if (migratetype != MIGRATE_MOVABLE)
>> +                       pcp_type = PCP_THP_UNMOVABLE;
>> +               else
>> +                       pcp_type = PCP_THP_MOVABLE;
>> +
>> +               return NR_LOWORDER_PCP_LISTS + pcp_type;
>>           }
>>    #else
>>           VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
>>    #endif
>>
>> -       return (MIGRATE_PCPTYPES * order) + migratetype;
>> +       return (MIGRATE_PCPTYPES * order) + pcp_type;
>>    }
>>
> 
> a minimum change might be, then you can drop most new code.
> 

Thanks, I will prepare the V2.


> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 120a317d0938..cfe1e0625e38 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -588,6 +588,7 @@ static void bad_page(struct page *page, const char *reason)
> 
>   static inline unsigned int order_to_pindex(int migratetype, int order)
>   {
> +       bool __maybe_unused movable;
>   #ifdef CONFIG_CMA
>          /*
>           * We shouldn't get here for MIGRATE_CMA if those pages don't
> @@ -600,7 +601,8 @@ static inline unsigned int order_to_pindex(int
> migratetype, int order)
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>          if (order > PAGE_ALLOC_COSTLY_ORDER) {
>                  VM_BUG_ON(order != pageblock_order);
> -               return NR_LOWORDER_PCP_LISTS;
> +               movable = migratetype == MIGRATE_MOVABLE;
> +               return NR_LOWORDER_PCP_LISTS + movable;
>          }
>   #else
>          VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
> 
> 
>>
>>
>> @@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex)
>>           int order = pindex / MIGRATE_PCPTYPES;
>>
>>    #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -       if (pindex == NR_LOWORDER_PCP_LISTS)
>> +       if (order > PAGE_ALLOC_COSTLY_ORDER)
>>                   order = HPAGE_PMD_ORDER;
>>    #else
>>           VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
>>
>>
>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>>>>>>>>> refer to it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> +#else
>>>>>>>>>>>>                      page = rmqueue_pcplist(preferred_zone,
>>>>>>>>>>>> zone, order,
>>>>>>>>>>>>                                             migratetype,
>>>>>>>>>>>> alloc_flags);
>>>>>>>>>>>>                      if (likely(page))
>>>>>>>>>>>>                              goto out;
>>>>>>>>>>>> +#endif
>>>>>>>>>>>>              }
>>>>>>>>>>>>
>>>>>>>>>>>>              page = rmqueue_buddy(preferred_zone, zone, order,
>>>>>>>>>>>> alloc_flags,
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.7.4
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Barry
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>

next prev parent reply	other threads:[~2024-06-19  8:35 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-04  9:14 yangge1116
2024-06-04 12:01 ` Baolin Wang
2024-06-04 12:36   ` yangge1116
2024-06-06  3:06     ` Baolin Wang
2024-06-06  9:10       ` yangge1116
2024-06-17 10:43       ` Barry Song
2024-06-17 11:36         ` Baolin Wang
2024-06-17 11:55           ` Barry Song
2024-06-18  3:31             ` yangge1116
2024-06-17 10:26 ` Barry Song
2024-06-17 12:47   ` yangge1116
2024-06-18  1:34     ` yangge1116
2024-06-18  1:55       ` Barry Song
2024-06-18  3:31         ` yangge1116
2024-06-18  4:10           ` Barry Song
2024-06-18  5:49             ` yangge1116
2024-06-18  6:55             ` yangge1116
2024-06-18  6:58               ` Barry Song
2024-06-18  7:51                 ` yangge1116
2024-06-19  5:34                   ` Ge Yang
2024-06-19  8:20                     ` Barry Song
2024-06-19  8:35                       ` Ge Yang [this message]
2024-06-18  3:40         ` yangge1116

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9a636c96-7279-4a84-9dbc-5ab4f262ab96@126.com \
    --to=yangge1116@126.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuzixing@hygon.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox