linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yongqiang Liu <liuyongqiang13@huawei.com>
To: Yang Shi <shy828301@gmail.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	<aarcange@redhat.com>, <hughd@google.com>, <mgorman@suse.de>,
	<mhocko@suse.cz>, <cl@gentwo.org>, <zokeefe@google.com>,
	<rientjes@google.com>, Matthew Wilcox <willy@infradead.org>,
	<peterx@redhat.com>,
	"Wangkefeng (OS Kernel Lab)" <wangkefeng.wang@huawei.com>,
	"zhangxiaoxu (A)" <zhangxiaoxu5@huawei.com>,
	<kirill.shutemov@linux.intel.com>,
	Lu Jialin <lujialin4@huawei.com>
Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled
Date: Thu, 1 Dec 2022 10:22:31 +0800	[thread overview]
Message-ID: <77174872-b823-3d29-1a9f-d0a9a19c3157@huawei.com> (raw)
In-Reply-To: <CAHbLzkpENMxuPQdHyehR_kMO8msAbGtHC+N=VD0eE7Nkeo799Q@mail.gmail.com>


在 2022/11/30 1:23, Yang Shi 写道:
> On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote:
>>
>> 在 2022/11/29 4:01, Yang Shi 写道:
>>> On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote:
>>>> Hi,
>>>>
>>>> We use mm_counter to how much a process physical memory used. Meanwhile,
>>>> page_counter of a memcg is used to count how much a cgroup physical
>>>> memory used.
>>>> If a cgroup only contains a process, they looks almost the same. But with
>>>> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
>>>> more than rss
>>>> in proc/[pid]/smaps_rollup as follow:
>>>>
>>>> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes
>>>> 1080930304
>>>> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs
>>>> 1290
>>>> [root@localhost sda]# cat /proc/1290/smaps_rollup
>>>> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0
>>>> [rollup]
>>>> Rss:              500648 kB
>>>> Pss:              498337 kB
>>>> Shared_Clean:       2732 kB
>>>> Shared_Dirty:          0 kB
>>>> Private_Clean:       364 kB
>>>> Private_Dirty:    497552 kB
>>>> Referenced:       500648 kB
>>>> Anonymous:        492016 kB
>>>> LazyFree:              0 kB
>>>> AnonHugePages:    129024 kB
>>>> ShmemPmdMapped:        0 kB
>>>> Shared_Hugetlb:        0 kB
>>>> Private_Hugetlb:       0 kB
>>>> Swap:                  0 kB
>>>> SwapPss:               0 kB
>>>> Locked:                0 kB
>>>> THPeligible:    0
>>>>
>>>> I have found the differences was because that __split_huge_pmd decrease
>>>> the mm_counter but page_counter in memcg was not decreased with refcount
>>>> of a head page is not zero. Here are the follows:
>>>>
>>>> do_madvise
>>>>      madvise_dontneed_free
>>>>        zap_page_range
>>>>          unmap_single_vma
>>>>            zap_pud_range
>>>>              zap_pmd_range
>>>>                __split_huge_pmd
>>>>                  __split_huge_pmd_locked
>>>>                    __mod_lruvec_page_state
>>>>                zap_pte_range
>>>>                   add_mm_rss_vec
>>>>                      add_mm_counter                    -> decrease the
>>>> mm_counter
>>>>          tlb_finish_mmu
>>>>            arch_tlb_finish_mmu
>>>>              tlb_flush_mmu_free
>>>>                free_pages_and_swap_cache
>>>>                  release_pages
>>>>                    folio_put_testzero(page)            -> not zero, skip
>>>>                      continue;
>>>>                    __folio_put_large
>>>>                      free_transhuge_page
>>>>                        free_compound_page
>>>>                          mem_cgroup_uncharge
>>>>                            page_counter_uncharge        -> decrease the
>>>> page_counter
>>>>
>>>> node_page_stat which shows in meminfo was also decreased. the
>>>> __split_huge_pmd
>>>> seems free no physical memory unless the total THP was free.I am
>>>> confused which
>>>> one is the true physical memory used of a process.
>>> This should be caused by the deferred split of THP. When MADV_DONTNEED
>>> is called on the partial of the map, the huge PMD is split, but the
>>> THP itself will not be split until the memory pressure is hit (global
>>> or memcg limit). So the unmapped sub pages are actually not freed
>>> until that point. So the mm counter is decreased due to the zapping
>>> but the physical pages are not actually freed then uncharged from
>>> memcg.
>> Thanks!
>>
>> I don't know how much memory a real workload will cost.So I just
>>
>> test the max_usage_in_bytes of memcg with THP disabled and add a little bit
>>
>> more for the limit_in_byte of memcg with THP enabled which trigger a oom...
>>
>> (actually it costed 100M more with THP enabled). Another testcase which I
>>
>> known the amout of memory will cost don't trigger a oom with suitable
>>
>> memcg limit  and I see the THP split when the memory hit the limit.
>>
>>
>> I have another concern that k8s usually use (rss - files) to estimate
> Do you mean "workingset" used by some 3rd party k8s monitoring tools?
> I recall that depends on what monitoring tools you use, for example,
> some monitoring use active_anon + active_file.

Yes, I notice the k8s use a parent pod which set a memcg limit to cover all

child pods, and workingset monitor is watch the root memcg.

>
>> the memory workload but the anon_thp in the defered list charged
>>
>> in memcg will make it look higher than actucal. And it seems the
> Yes, but the deferred split shrinker should handle this quite gracefully.
>
>> container will be killed without oom...
> If you have some userspace daemons which monitor the memory usage by
> rss, and try to behave smarter to kill the container by looking at rss
> solely, you may kill the container prematurely.
Thanks.
>
>> Is it suitable to add meminfo of a deferred split list of THP?
> We could, but I don't think of how it will be used to improve the
> usecase. Any more thoughts?

In current k8s scenario, I think it will not kill the container with the 
parent

pod memcg limit set correctly.

Maybe  the meminfo with a split interface  will be helpful for user to

release memory in advance.

>>>> Kind regards,
>>>>
>>>> Yongqiang Liu
>>>>
>>>>
>>> .
> .


      reply	other threads:[~2022-12-01  2:22 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-26 13:09 Yongqiang Liu
2022-11-28 20:01 ` Yang Shi
2022-11-29  8:10   ` Michal Hocko
2022-11-29 13:19     ` Yongqiang Liu
2022-11-29 17:49     ` Yang Shi
2022-11-29 13:14   ` Yongqiang Liu
2022-11-29 17:23     ` Yang Shi
2022-12-01  2:22       ` Yongqiang Liu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=77174872-b823-3d29-1a9f-d0a9a19c3157@huawei.com \
    --to=liuyongqiang13@huawei.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lujialin4@huawei.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=zhangxiaoxu5@huawei.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox