linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Michal Hocko <mhocko@kernel.org>,
	kirill.shutemov@linux.intel.com,
	Yang Shi <yang.shi@linux.alibaba.com>
Cc: hannes@cmpxchg.org, rientjes@google.com,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable
Date: Thu, 22 Aug 2019 14:56:56 +0200	[thread overview]
Message-ID: <ee048bbf-3563-d695-ea58-5f1504aee35c@suse.cz> (raw)
In-Reply-To: <20190822080434.GF12785@dhcp22.suse.cz>

On 8/22/19 10:04 AM, Michal Hocko wrote:
> On Thu 22-08-19 01:55:25, Yang Shi wrote:
>> Available memory is one of the most important metrics for memory
>> pressure.
> 
> I would disagree with this statement. It is a rough estimate that tells
> how much memory you can allocate before going into a more expensive
> reclaim (mostly swapping). Allocating that amount still might result in
> direct reclaim induced stalls. I do realize that this is simple metric
> that is attractive to use and works in many cases though.
> 
>> Currently, the deferred split THPs are not accounted into
>> available memory, but they are reclaimable actually, like reclaimable
>> slabs.
>> 
>> And, they seems very common with the common workloads when THP is
>> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
>> always shows it could generate over fifteen thousand deferred split THPs
>> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
>> It looks worth accounting in MemAvailable.
> 
> OK, this makes sense. But your above numbers are really worrying.
> Accumulating such a large amount of pages that are likely not going to
> be used is really bad. They are essentially blocking any higher order
> allocations and also push the system towards more memory pressure.
> 
> IIUC deferred splitting is mostly a workaround for nasty locking issues
> during splitting, right? This is not really an optimization to cache
> THPs for reuse or something like that. What is the reason this is not
> done from a worker context? At least THPs which would be freed
> completely sound like a good candidate for kworker tear down, no?

Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
also wouldn't need the cgroup awareness?

>> Record the number of freeable normal pages of deferred split THPs into
>> the second tail page, and account it into KReclaimable.  Although THP
>> allocations are not exactly "kernel allocations", once they are unmapped,
>> they are in fact kernel-only.  KReclaimable has been accounted into
>> MemAvailable.
> 
> This sounds reasonable to me.
>  
>> When the deferred split THPs get split due to memory pressure or freed,
>> just decrease by the recorded number.
>> 
>> With this change when running program which populates 1G address space
>> then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
>> show the deferred split THPs are accounted properly.
>> 
>> Populated by before calling madvise(MADV_DONTNEED):
>> MemAvailable:   43531960 kB
>> AnonPages:       1096660 kB
>> KReclaimable:      26156 kB
>> AnonHugePages:   1056768 kB
>> 
>> After calling madvise(MADV_DONTNEED):
>> MemAvailable:   44411164 kB
>> AnonPages:         50140 kB
>> KReclaimable:    1070640 kB
>> AnonHugePages:     10240 kB
>> 
>> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: David Rientjes <rientjes@google.com>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>

Thanks, looks like it wasn't too difficult with the 2nd tail page use :)

...

>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page)
>>  
>>  	INIT_LIST_HEAD(page_deferred_list(page));
>>  	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
>> +	page[2].nr_freeable = 0;
>>  }
>>  
>>  static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
>> @@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>  		if (!list_empty(page_deferred_list(head))) {
>>  			ds_queue->split_queue_len--;
>>  			list_del(page_deferred_list(head));
>> +			__mod_node_page_state(page_pgdat(page),
>> +					NR_KERNEL_MISC_RECLAIMABLE,
>> +					-head[2].nr_freeable);
>> +			head[2].nr_freeable = 0;
>>  		}
>>  		if (mapping)
>>  			__dec_node_page_state(page, NR_SHMEM_THPS);
>> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page)
>>  		ds_queue->split_queue_len--;
>>  		list_del(page_deferred_list(page));
>>  	}
>> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
>> +			      -page[2].nr_freeable);
>> +	page[2].nr_freeable = 0;

Wouldn't it be safer to fully tie the nr_freeable use to adding the page to the
deffered list? So here the code would be in the if (!list_empty()) { } part above.

>>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>  	free_compound_page(page);
>>  }
>>  
>> -void deferred_split_huge_page(struct page *page)
>> +void deferred_split_huge_page(struct page *page, unsigned int nr)
>>  {
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(page);
>>  #ifdef CONFIG_MEMCG
>> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page)
>>  		return;
>>  
>>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> +	page[2].nr_freeable += nr;
>> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
>> +			      nr);

Same here, only do this when adding to the list, below? Or we might perhaps
account base pages multiple times?

>>  	if (list_empty(page_deferred_list(page))) {
>>  		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>>  		list_add_tail(page_deferred_list(page), &ds_queue->split_queue);
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index e5dfe2a..6008fab 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1286,7 +1286,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
>>  
>>  	if (nr) {
>>  		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
>> -		deferred_split_huge_page(page);
>> +		deferred_split_huge_page(page, nr);
>>  	}
>>  }
>>  
>> @@ -1320,7 +1320,7 @@ void page_remove_rmap(struct page *page, bool compound)
>>  		clear_page_mlock(page);
>>  
>>  	if (PageTransCompound(page))
>> -		deferred_split_huge_page(compound_head(page));
>> +		deferred_split_huge_page(compound_head(page), 1);
>>  
>>  	/*
>>  	 * It would be tidy to reset the PageAnon mapping here,
>> -- 
>> 1.8.3.1
> 



  reply	other threads:[~2019-08-22 12:57 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-21 17:55 Yang Shi
2019-08-22  8:04 ` Michal Hocko
2019-08-22 12:56   ` Vlastimil Babka [this message]
2019-08-22 15:29     ` Kirill A. Shutemov
2019-08-26  7:40       ` Michal Hocko
2019-08-26 13:15         ` Kirill A. Shutemov
2019-08-27  6:01           ` Michal Hocko
2019-08-27 11:02             ` Kirill A. Shutemov
2019-08-27 11:48               ` Michal Hocko
2019-08-27 12:01               ` Vlastimil Babka
2019-08-27 12:09                 ` Michal Hocko
2019-08-27 12:17                   ` Kirill A. Shutemov
2019-08-27 12:59                     ` Kirill A. Shutemov
2019-08-27 17:06                       ` Yang Shi
2019-08-28  7:57                         ` Michal Hocko
2019-08-28 14:03                           ` Kirill A. Shutemov
     [not found]                             ` <20190828141253.GM28313@dhcp22.suse.cz>
2019-08-28 14:46                               ` Kirill A. Shutemov
2019-08-28 16:02                                 ` Michal Hocko
2019-08-29 17:03                                   ` Yang Shi
2019-08-30  6:23                                     ` Michal Hocko
2019-08-30 12:53                                   ` Kirill A. Shutemov
2019-08-22 15:49     ` Kirill A. Shutemov
2019-08-22 15:57     ` Yang Shi
2019-08-22 15:33   ` Yang Shi
2019-08-26  7:43     ` Michal Hocko
2019-08-27  4:27       ` Yang Shi
2019-08-27  5:59         ` Michal Hocko
2019-08-27  8:32           ` Kirill A. Shutemov
2019-08-27  9:00             ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ee048bbf-3563-d695-ea58-5f1504aee35c@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox