From: Yongqiang Liu <liuyongqiang13@huawei.com>
To: Yang Shi <shy828301@gmail.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
<aarcange@redhat.com>, <hughd@google.com>, <mgorman@suse.de>,
<mhocko@suse.cz>, <cl@gentwo.org>, <zokeefe@google.com>,
<rientjes@google.com>, Matthew Wilcox <willy@infradead.org>,
<peterx@redhat.com>,
"Wangkefeng (OS Kernel Lab)" <wangkefeng.wang@huawei.com>,
"zhangxiaoxu (A)" <zhangxiaoxu5@huawei.com>,
<kirill.shutemov@linux.intel.com>,
Lu Jialin <lujialin4@huawei.com>
Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled
Date: Thu, 1 Dec 2022 10:22:31 +0800 [thread overview]
Message-ID: <77174872-b823-3d29-1a9f-d0a9a19c3157@huawei.com> (raw)
In-Reply-To: <CAHbLzkpENMxuPQdHyehR_kMO8msAbGtHC+N=VD0eE7Nkeo799Q@mail.gmail.com>
在 2022/11/30 1:23, Yang Shi 写道:
> On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote:
>>
>> 在 2022/11/29 4:01, Yang Shi 写道:
>>> On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote:
>>>> Hi,
>>>>
>>>> We use mm_counter to how much a process physical memory used. Meanwhile,
>>>> page_counter of a memcg is used to count how much a cgroup physical
>>>> memory used.
>>>> If a cgroup only contains a process, they looks almost the same. But with
>>>> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
>>>> more than rss
>>>> in proc/[pid]/smaps_rollup as follow:
>>>>
>>>> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes
>>>> 1080930304
>>>> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs
>>>> 1290
>>>> [root@localhost sda]# cat /proc/1290/smaps_rollup
>>>> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0
>>>> [rollup]
>>>> Rss: 500648 kB
>>>> Pss: 498337 kB
>>>> Shared_Clean: 2732 kB
>>>> Shared_Dirty: 0 kB
>>>> Private_Clean: 364 kB
>>>> Private_Dirty: 497552 kB
>>>> Referenced: 500648 kB
>>>> Anonymous: 492016 kB
>>>> LazyFree: 0 kB
>>>> AnonHugePages: 129024 kB
>>>> ShmemPmdMapped: 0 kB
>>>> Shared_Hugetlb: 0 kB
>>>> Private_Hugetlb: 0 kB
>>>> Swap: 0 kB
>>>> SwapPss: 0 kB
>>>> Locked: 0 kB
>>>> THPeligible: 0
>>>>
>>>> I have found the differences was because that __split_huge_pmd decrease
>>>> the mm_counter but page_counter in memcg was not decreased with refcount
>>>> of a head page is not zero. Here are the follows:
>>>>
>>>> do_madvise
>>>> madvise_dontneed_free
>>>> zap_page_range
>>>> unmap_single_vma
>>>> zap_pud_range
>>>> zap_pmd_range
>>>> __split_huge_pmd
>>>> __split_huge_pmd_locked
>>>> __mod_lruvec_page_state
>>>> zap_pte_range
>>>> add_mm_rss_vec
>>>> add_mm_counter -> decrease the
>>>> mm_counter
>>>> tlb_finish_mmu
>>>> arch_tlb_finish_mmu
>>>> tlb_flush_mmu_free
>>>> free_pages_and_swap_cache
>>>> release_pages
>>>> folio_put_testzero(page) -> not zero, skip
>>>> continue;
>>>> __folio_put_large
>>>> free_transhuge_page
>>>> free_compound_page
>>>> mem_cgroup_uncharge
>>>> page_counter_uncharge -> decrease the
>>>> page_counter
>>>>
>>>> node_page_stat which shows in meminfo was also decreased. the
>>>> __split_huge_pmd
>>>> seems free no physical memory unless the total THP was free.I am
>>>> confused which
>>>> one is the true physical memory used of a process.
>>> This should be caused by the deferred split of THP. When MADV_DONTNEED
>>> is called on the partial of the map, the huge PMD is split, but the
>>> THP itself will not be split until the memory pressure is hit (global
>>> or memcg limit). So the unmapped sub pages are actually not freed
>>> until that point. So the mm counter is decreased due to the zapping
>>> but the physical pages are not actually freed then uncharged from
>>> memcg.
>> Thanks!
>>
>> I don't know how much memory a real workload will cost.So I just
>>
>> test the max_usage_in_bytes of memcg with THP disabled and add a little bit
>>
>> more for the limit_in_byte of memcg with THP enabled which trigger a oom...
>>
>> (actually it costed 100M more with THP enabled). Another testcase which I
>>
>> known the amout of memory will cost don't trigger a oom with suitable
>>
>> memcg limit and I see the THP split when the memory hit the limit.
>>
>>
>> I have another concern that k8s usually use (rss - files) to estimate
> Do you mean "workingset" used by some 3rd party k8s monitoring tools?
> I recall that depends on what monitoring tools you use, for example,
> some monitoring use active_anon + active_file.
Yes, I notice the k8s use a parent pod which set a memcg limit to cover all
child pods, and workingset monitor is watch the root memcg.
>
>> the memory workload but the anon_thp in the defered list charged
>>
>> in memcg will make it look higher than actucal. And it seems the
> Yes, but the deferred split shrinker should handle this quite gracefully.
>
>> container will be killed without oom...
> If you have some userspace daemons which monitor the memory usage by
> rss, and try to behave smarter to kill the container by looking at rss
> solely, you may kill the container prematurely.
Thanks.
>
>> Is it suitable to add meminfo of a deferred split list of THP?
> We could, but I don't think of how it will be used to improve the
> usecase. Any more thoughts?
In current k8s scenario, I think it will not kill the container with the
parent
pod memcg limit set correctly.
Maybe the meminfo with a split interface will be helpful for user to
release memory in advance.
>>>> Kind regards,
>>>>
>>>> Yongqiang Liu
>>>>
>>>>
>>> .
> .
prev parent reply other threads:[~2022-12-01 2:22 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-26 13:09 Yongqiang Liu
2022-11-28 20:01 ` Yang Shi
2022-11-29 8:10 ` Michal Hocko
2022-11-29 13:19 ` Yongqiang Liu
2022-11-29 17:49 ` Yang Shi
2022-11-29 13:14 ` Yongqiang Liu
2022-11-29 17:23 ` Yang Shi
2022-12-01 2:22 ` Yongqiang Liu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=77174872-b823-3d29-1a9f-d0a9a19c3157@huawei.com \
--to=liuyongqiang13@huawei.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cl@gentwo.org \
--cc=hughd@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lujialin4@huawei.com \
--cc=mgorman@suse.de \
--cc=mhocko@suse.cz \
--cc=peterx@redhat.com \
--cc=rientjes@google.com \
--cc=shy828301@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
--cc=zhangxiaoxu5@huawei.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox