Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yang Shi <shy828301@gmail.com>
To: Yongqiang Liu <liuyongqiang13@huawei.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	 "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	aarcange@redhat.com, hughd@google.com,  mgorman@suse.de,
	mhocko@suse.cz, cl@gentwo.org, zokeefe@google.com,
	 rientjes@google.com, Matthew Wilcox <willy@infradead.org>,
	peterx@redhat.com,
	 "Wangkefeng (OS Kernel Lab)" <wangkefeng.wang@huawei.com>,
	"zhangxiaoxu (A)" <zhangxiaoxu5@huawei.com>,
	 kirill.shutemov@linux.intel.com,
	Lu Jialin <lujialin4@huawei.com>
Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled
Date: Tue, 29 Nov 2022 09:23:08 -0800	[thread overview]
Message-ID: <CAHbLzkpENMxuPQdHyehR_kMO8msAbGtHC+N=VD0eE7Nkeo799Q@mail.gmail.com> (raw)
In-Reply-To: <6b7142cf-386e-e1d2-a122-b923337a593e@huawei.com>

On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote:
>
>
> 在 2022/11/29 4:01, Yang Shi 写道:
> > On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote:
> >> Hi,
> >>
> >> We use mm_counter to how much a process physical memory used. Meanwhile,
> >> page_counter of a memcg is used to count how much a cgroup physical
> >> memory used.
> >> If a cgroup only contains a process, they looks almost the same. But with
> >> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
> >> more than rss
> >> in proc/[pid]/smaps_rollup as follow:
> >>
> >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes
> >> 1080930304
> >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs
> >> 1290
> >> [root@localhost sda]# cat /proc/1290/smaps_rollup
> >> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0
> >> [rollup]
> >> Rss:              500648 kB
> >> Pss:              498337 kB
> >> Shared_Clean:       2732 kB
> >> Shared_Dirty:          0 kB
> >> Private_Clean:       364 kB
> >> Private_Dirty:    497552 kB
> >> Referenced:       500648 kB
> >> Anonymous:        492016 kB
> >> LazyFree:              0 kB
> >> AnonHugePages:    129024 kB
> >> ShmemPmdMapped:        0 kB
> >> Shared_Hugetlb:        0 kB
> >> Private_Hugetlb:       0 kB
> >> Swap:                  0 kB
> >> SwapPss:               0 kB
> >> Locked:                0 kB
> >> THPeligible:    0
> >>
> >> I have found the differences was because that __split_huge_pmd decrease
> >> the mm_counter but page_counter in memcg was not decreased with refcount
> >> of a head page is not zero. Here are the follows:
> >>
> >> do_madvise
> >>     madvise_dontneed_free
> >>       zap_page_range
> >>         unmap_single_vma
> >>           zap_pud_range
> >>             zap_pmd_range
> >>               __split_huge_pmd
> >>                 __split_huge_pmd_locked
> >>                   __mod_lruvec_page_state
> >>               zap_pte_range
> >>                  add_mm_rss_vec
> >>                     add_mm_counter                    -> decrease the
> >> mm_counter
> >>         tlb_finish_mmu
> >>           arch_tlb_finish_mmu
> >>             tlb_flush_mmu_free
> >>               free_pages_and_swap_cache
> >>                 release_pages
> >>                   folio_put_testzero(page)            -> not zero, skip
> >>                     continue;
> >>                   __folio_put_large
> >>                     free_transhuge_page
> >>                       free_compound_page
> >>                         mem_cgroup_uncharge
> >>                           page_counter_uncharge        -> decrease the
> >> page_counter
> >>
> >> node_page_stat which shows in meminfo was also decreased. the
> >> __split_huge_pmd
> >> seems free no physical memory unless the total THP was free.I am
> >> confused which
> >> one is the true physical memory used of a process.
> > This should be caused by the deferred split of THP. When MADV_DONTNEED
> > is called on the partial of the map, the huge PMD is split, but the
> > THP itself will not be split until the memory pressure is hit (global
> > or memcg limit). So the unmapped sub pages are actually not freed
> > until that point. So the mm counter is decreased due to the zapping
> > but the physical pages are not actually freed then uncharged from
> > memcg.
>
> Thanks!
>
> I don't know how much memory a real workload will cost.So I just
>
> test the max_usage_in_bytes of memcg with THP disabled and add a little bit
>
> more for the limit_in_byte of memcg with THP enabled which trigger a oom...
>
> (actually it costed 100M more with THP enabled). Another testcase which I
>
> known the amout of memory will cost don't trigger a oom with suitable
>
> memcg limit  and I see the THP split when the memory hit the limit.
>
>
> I have another concern that k8s usually use (rss - files) to estimate

Do you mean "workingset" used by some 3rd party k8s monitoring tools?
I recall that depends on what monitoring tools you use, for example,
some monitoring use active_anon + active_file.

>
> the memory workload but the anon_thp in the defered list charged
>
> in memcg will make it look higher than actucal. And it seems the

Yes, but the deferred split shrinker should handle this quite gracefully.

>
> container will be killed without oom...

If you have some userspace daemons which monitor the memory usage by
rss, and try to behave smarter to kill the container by looking at rss
solely, you may kill the container prematurely.

>
> Is it suitable to add meminfo of a deferred split list of THP?

We could, but I don't think of how it will be used to improve the
usecase. Any more thoughts?

>
> >>
> >> Kind regards,
> >>
> >> Yongqiang Liu
> >>
> >>
> > .

next prev parent reply	other threads:[~2022-11-29 17:23 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-26 13:09 Yongqiang Liu
2022-11-28 20:01 ` Yang Shi
2022-11-29  8:10   ` Michal Hocko
2022-11-29 13:19     ` Yongqiang Liu
2022-11-29 17:49     ` Yang Shi
2022-11-29 13:14   ` Yongqiang Liu
2022-11-29 17:23     ` Yang Shi [this message]
2022-12-01  2:22       ` Yongqiang Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHbLzkpENMxuPQdHyehR_kMO8msAbGtHC+N=VD0eE7Nkeo799Q@mail.gmail.com' \
    --to=shy828301@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuyongqiang13@huawei.com \
    --cc=lujialin4@huawei.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=zhangxiaoxu5@huawei.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox