* [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled @ 2022-11-26 13:09 Yongqiang Liu 2022-11-28 20:01 ` Yang Shi 0 siblings, 1 reply; 8+ messages in thread From: Yongqiang Liu @ 2022-11-26 13:09 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: akpm, aarcange, hughd, mgorman, mhocko, cl, n-horiguchi, zokeefe, rientjes, Matthew Wilcox, peterx, Wangkefeng (OS Kernel Lab), zhangxiaoxu (A), kirill.shutemov, Yongqiang Liu, Lu Jialin Hi, We use mm_counter to how much a process physical memory used. Meanwhile, page_counter of a memcg is used to count how much a cgroup physical memory used. If a cgroup only contains a process, they looks almost the same. But with THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or more than rss in proc/[pid]/smaps_rollup as follow: [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes 1080930304 [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs 1290 [root@localhost sda]# cat /proc/1290/smaps_rollup 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0 [rollup] Rss: 500648 kB Pss: 498337 kB Shared_Clean: 2732 kB Shared_Dirty: 0 kB Private_Clean: 364 kB Private_Dirty: 497552 kB Referenced: 500648 kB Anonymous: 492016 kB LazyFree: 0 kB AnonHugePages: 129024 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 0 I have found the differences was because that __split_huge_pmd decrease the mm_counter but page_counter in memcg was not decreased with refcount of a head page is not zero. Here are the follows: do_madvise madvise_dontneed_free zap_page_range unmap_single_vma zap_pud_range zap_pmd_range __split_huge_pmd __split_huge_pmd_locked __mod_lruvec_page_state zap_pte_range add_mm_rss_vec add_mm_counter -> decrease the mm_counter tlb_finish_mmu arch_tlb_finish_mmu tlb_flush_mmu_free free_pages_and_swap_cache release_pages folio_put_testzero(page) -> not zero, skip continue; __folio_put_large free_transhuge_page free_compound_page mem_cgroup_uncharge page_counter_uncharge -> decrease the page_counter node_page_stat which shows in meminfo was also decreased. the __split_huge_pmd seems free no physical memory unless the total THP was free.I am confused which one is the true physical memory used of a process. Kind regards, Yongqiang Liu ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled 2022-11-26 13:09 [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled Yongqiang Liu @ 2022-11-28 20:01 ` Yang Shi 2022-11-29 8:10 ` Michal Hocko 2022-11-29 13:14 ` Yongqiang Liu 0 siblings, 2 replies; 8+ messages in thread From: Yang Shi @ 2022-11-28 20:01 UTC (permalink / raw) To: Yongqiang Liu Cc: linux-kernel, linux-mm, akpm, aarcange, hughd, mgorman, mhocko, cl, n-horiguchi, zokeefe, rientjes, Matthew Wilcox, peterx, Wangkefeng (OS Kernel Lab), zhangxiaoxu (A), kirill.shutemov, Lu Jialin On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote: > > Hi, > > We use mm_counter to how much a process physical memory used. Meanwhile, > page_counter of a memcg is used to count how much a cgroup physical > memory used. > If a cgroup only contains a process, they looks almost the same. But with > THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or > more than rss > in proc/[pid]/smaps_rollup as follow: > > [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes > 1080930304 > [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs > 1290 > [root@localhost sda]# cat /proc/1290/smaps_rollup > 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0 > [rollup] > Rss: 500648 kB > Pss: 498337 kB > Shared_Clean: 2732 kB > Shared_Dirty: 0 kB > Private_Clean: 364 kB > Private_Dirty: 497552 kB > Referenced: 500648 kB > Anonymous: 492016 kB > LazyFree: 0 kB > AnonHugePages: 129024 kB > ShmemPmdMapped: 0 kB > Shared_Hugetlb: 0 kB > Private_Hugetlb: 0 kB > Swap: 0 kB > SwapPss: 0 kB > Locked: 0 kB > THPeligible: 0 > > I have found the differences was because that __split_huge_pmd decrease > the mm_counter but page_counter in memcg was not decreased with refcount > of a head page is not zero. Here are the follows: > > do_madvise > madvise_dontneed_free > zap_page_range > unmap_single_vma > zap_pud_range > zap_pmd_range > __split_huge_pmd > __split_huge_pmd_locked > __mod_lruvec_page_state > zap_pte_range > add_mm_rss_vec > add_mm_counter -> decrease the > mm_counter > tlb_finish_mmu > arch_tlb_finish_mmu > tlb_flush_mmu_free > free_pages_and_swap_cache > release_pages > folio_put_testzero(page) -> not zero, skip > continue; > __folio_put_large > free_transhuge_page > free_compound_page > mem_cgroup_uncharge > page_counter_uncharge -> decrease the > page_counter > > node_page_stat which shows in meminfo was also decreased. the > __split_huge_pmd > seems free no physical memory unless the total THP was free.I am > confused which > one is the true physical memory used of a process. This should be caused by the deferred split of THP. When MADV_DONTNEED is called on the partial of the map, the huge PMD is split, but the THP itself will not be split until the memory pressure is hit (global or memcg limit). So the unmapped sub pages are actually not freed until that point. So the mm counter is decreased due to the zapping but the physical pages are not actually freed then uncharged from memcg. > > > Kind regards, > > Yongqiang Liu > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled 2022-11-28 20:01 ` Yang Shi @ 2022-11-29 8:10 ` Michal Hocko 2022-11-29 13:19 ` Yongqiang Liu 2022-11-29 17:49 ` Yang Shi 2022-11-29 13:14 ` Yongqiang Liu 1 sibling, 2 replies; 8+ messages in thread From: Michal Hocko @ 2022-11-29 8:10 UTC (permalink / raw) To: Yang Shi Cc: Yongqiang Liu, linux-kernel, linux-mm, akpm, aarcange, hughd, mgorman, cl, n-horiguchi, zokeefe, rientjes, Matthew Wilcox, peterx, Wangkefeng (OS Kernel Lab), zhangxiaoxu (A), kirill.shutemov, Lu Jialin On Mon 28-11-22 12:01:37, Yang Shi wrote: > On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote: > > > > Hi, > > > > We use mm_counter to how much a process physical memory used. Meanwhile, > > page_counter of a memcg is used to count how much a cgroup physical > > memory used. > > If a cgroup only contains a process, they looks almost the same. But with > > THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or > > more than rss > > in proc/[pid]/smaps_rollup as follow: [...] > > node_page_stat which shows in meminfo was also decreased. the > > __split_huge_pmd > > seems free no physical memory unless the total THP was free.I am > > confused which > > one is the true physical memory used of a process. > > This should be caused by the deferred split of THP. When MADV_DONTNEED > is called on the partial of the map, the huge PMD is split, but the > THP itself will not be split until the memory pressure is hit (global > or memcg limit). So the unmapped sub pages are actually not freed > until that point. So the mm counter is decreased due to the zapping > but the physical pages are not actually freed then uncharged from > memcg. Yes, and this is not really bound to THP. Consider a page cache. It can be accessed via syscalls when it doesn't correspondent to rss at all while it is still charged to a memcg. Or it can be mapped and then later unmapped so it disappear from rss while it is still charged until it gets reclaimed by the memory pressure. Or it can be an in-memory object that is not bound to any process life time (e.g. tmpfs). Or it can be a kernel memory charged to a memcg which is not covered by rss because it is either not mapped or it is unknown to rss counters. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled 2022-11-29 8:10 ` Michal Hocko @ 2022-11-29 13:19 ` Yongqiang Liu 2022-11-29 17:49 ` Yang Shi 1 sibling, 0 replies; 8+ messages in thread From: Yongqiang Liu @ 2022-11-29 13:19 UTC (permalink / raw) To: Michal Hocko, Yang Shi Cc: linux-kernel, linux-mm, akpm, aarcange, hughd, mgorman, cl, n-horiguchi, zokeefe, rientjes, Matthew Wilcox, peterx, Wangkefeng (OS Kernel Lab), zhangxiaoxu (A), kirill.shutemov, Lu Jialin 在 2022/11/29 16:10, Michal Hocko 写道: > On Mon 28-11-22 12:01:37, Yang Shi wrote: >> On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote: >>> Hi, >>> >>> We use mm_counter to how much a process physical memory used. Meanwhile, >>> page_counter of a memcg is used to count how much a cgroup physical >>> memory used. >>> If a cgroup only contains a process, they looks almost the same. But with >>> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or >>> more than rss >>> in proc/[pid]/smaps_rollup as follow: > [...] >>> node_page_stat which shows in meminfo was also decreased. the >>> __split_huge_pmd >>> seems free no physical memory unless the total THP was free.I am >>> confused which >>> one is the true physical memory used of a process. >> This should be caused by the deferred split of THP. When MADV_DONTNEED >> is called on the partial of the map, the huge PMD is split, but the >> THP itself will not be split until the memory pressure is hit (global >> or memcg limit). So the unmapped sub pages are actually not freed >> until that point. So the mm counter is decreased due to the zapping >> but the physical pages are not actually freed then uncharged from >> memcg. > Yes, and this is not really bound to THP. Consider a page cache. It can > be accessed via syscalls when it doesn't correspondent to rss at all > while it is still charged to a memcg. Or it can be mapped and then later > unmapped so it disappear from rss while it is still charged until it > gets reclaimed by the memory pressure. Or it can be an in-memory object > that is not bound to any process life time (e.g. tmpfs). Or it can be a > kernel memory charged to a memcg which is not covered by rss because it > is either not mapped or it is unknown to rss counters. Thanks ! it's very nice to me. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled 2022-11-29 8:10 ` Michal Hocko 2022-11-29 13:19 ` Yongqiang Liu @ 2022-11-29 17:49 ` Yang Shi 1 sibling, 0 replies; 8+ messages in thread From: Yang Shi @ 2022-11-29 17:49 UTC (permalink / raw) To: Michal Hocko Cc: Yongqiang Liu, linux-kernel, linux-mm, akpm, aarcange, hughd, mgorman, cl, zokeefe, rientjes, Matthew Wilcox, peterx, Wangkefeng (OS Kernel Lab), zhangxiaoxu (A), kirill.shutemov, Lu Jialin On Tue, Nov 29, 2022 at 12:10 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 28-11-22 12:01:37, Yang Shi wrote: > > On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote: > > > > > > Hi, > > > > > > We use mm_counter to how much a process physical memory used. Meanwhile, > > > page_counter of a memcg is used to count how much a cgroup physical > > > memory used. > > > If a cgroup only contains a process, they looks almost the same. But with > > > THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or > > > more than rss > > > in proc/[pid]/smaps_rollup as follow: > [...] > > > node_page_stat which shows in meminfo was also decreased. the > > > __split_huge_pmd > > > seems free no physical memory unless the total THP was free.I am > > > confused which > > > one is the true physical memory used of a process. > > > > This should be caused by the deferred split of THP. When MADV_DONTNEED > > is called on the partial of the map, the huge PMD is split, but the > > THP itself will not be split until the memory pressure is hit (global > > or memcg limit). So the unmapped sub pages are actually not freed > > until that point. So the mm counter is decreased due to the zapping > > but the physical pages are not actually freed then uncharged from > > memcg. > > Yes, and this is not really bound to THP. Consider a page cache. It can > be accessed via syscalls when it doesn't correspondent to rss at all > while it is still charged to a memcg. Or it can be mapped and then later > unmapped so it disappear from rss while it is still charged until it > gets reclaimed by the memory pressure. Or it can be an in-memory object > that is not bound to any process life time (e.g. tmpfs). Or it can be a > kernel memory charged to a memcg which is not covered by rss because it > is either not mapped or it is unknown to rss counters. Yes, good points. Thanks, Michal. And one more thing worth mentioning is that the RSS shown by ps or smaps is different from the RSS shown by memcg. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled 2022-11-28 20:01 ` Yang Shi 2022-11-29 8:10 ` Michal Hocko @ 2022-11-29 13:14 ` Yongqiang Liu 2022-11-29 17:23 ` Yang Shi 1 sibling, 1 reply; 8+ messages in thread From: Yongqiang Liu @ 2022-11-29 13:14 UTC (permalink / raw) To: Yang Shi Cc: linux-kernel, linux-mm, akpm, aarcange, hughd, mgorman, mhocko, cl, n-horiguchi, zokeefe, rientjes, Matthew Wilcox, peterx, Wangkefeng (OS Kernel Lab), zhangxiaoxu (A), kirill.shutemov, Lu Jialin 在 2022/11/29 4:01, Yang Shi 写道: > On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote: >> Hi, >> >> We use mm_counter to how much a process physical memory used. Meanwhile, >> page_counter of a memcg is used to count how much a cgroup physical >> memory used. >> If a cgroup only contains a process, they looks almost the same. But with >> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or >> more than rss >> in proc/[pid]/smaps_rollup as follow: >> >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes >> 1080930304 >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs >> 1290 >> [root@localhost sda]# cat /proc/1290/smaps_rollup >> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0 >> [rollup] >> Rss: 500648 kB >> Pss: 498337 kB >> Shared_Clean: 2732 kB >> Shared_Dirty: 0 kB >> Private_Clean: 364 kB >> Private_Dirty: 497552 kB >> Referenced: 500648 kB >> Anonymous: 492016 kB >> LazyFree: 0 kB >> AnonHugePages: 129024 kB >> ShmemPmdMapped: 0 kB >> Shared_Hugetlb: 0 kB >> Private_Hugetlb: 0 kB >> Swap: 0 kB >> SwapPss: 0 kB >> Locked: 0 kB >> THPeligible: 0 >> >> I have found the differences was because that __split_huge_pmd decrease >> the mm_counter but page_counter in memcg was not decreased with refcount >> of a head page is not zero. Here are the follows: >> >> do_madvise >> madvise_dontneed_free >> zap_page_range >> unmap_single_vma >> zap_pud_range >> zap_pmd_range >> __split_huge_pmd >> __split_huge_pmd_locked >> __mod_lruvec_page_state >> zap_pte_range >> add_mm_rss_vec >> add_mm_counter -> decrease the >> mm_counter >> tlb_finish_mmu >> arch_tlb_finish_mmu >> tlb_flush_mmu_free >> free_pages_and_swap_cache >> release_pages >> folio_put_testzero(page) -> not zero, skip >> continue; >> __folio_put_large >> free_transhuge_page >> free_compound_page >> mem_cgroup_uncharge >> page_counter_uncharge -> decrease the >> page_counter >> >> node_page_stat which shows in meminfo was also decreased. the >> __split_huge_pmd >> seems free no physical memory unless the total THP was free.I am >> confused which >> one is the true physical memory used of a process. > This should be caused by the deferred split of THP. When MADV_DONTNEED > is called on the partial of the map, the huge PMD is split, but the > THP itself will not be split until the memory pressure is hit (global > or memcg limit). So the unmapped sub pages are actually not freed > until that point. So the mm counter is decreased due to the zapping > but the physical pages are not actually freed then uncharged from > memcg. Thanks! I don't know how much memory a real workload will cost.So I just test the max_usage_in_bytes of memcg with THP disabled and add a little bit more for the limit_in_byte of memcg with THP enabled which trigger a oom... (actually it costed 100M more with THP enabled). Another testcase which I known the amout of memory will cost don't trigger a oom with suitable memcg limit and I see the THP split when the memory hit the limit. I have another concern that k8s usually use (rss - files) to estimate the memory workload but the anon_thp in the defered list charged in memcg will make it look higher than actucal. And it seems the container will be killed without oom... Is it suitable to add meminfo of a deferred split list of THP? >> >> Kind regards, >> >> Yongqiang Liu >> >> > . ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled 2022-11-29 13:14 ` Yongqiang Liu @ 2022-11-29 17:23 ` Yang Shi 2022-12-01 2:22 ` Yongqiang Liu 0 siblings, 1 reply; 8+ messages in thread From: Yang Shi @ 2022-11-29 17:23 UTC (permalink / raw) To: Yongqiang Liu Cc: linux-kernel, linux-mm, akpm, aarcange, hughd, mgorman, mhocko, cl, zokeefe, rientjes, Matthew Wilcox, peterx, Wangkefeng (OS Kernel Lab), zhangxiaoxu (A), kirill.shutemov, Lu Jialin On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote: > > > 在 2022/11/29 4:01, Yang Shi 写道: > > On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote: > >> Hi, > >> > >> We use mm_counter to how much a process physical memory used. Meanwhile, > >> page_counter of a memcg is used to count how much a cgroup physical > >> memory used. > >> If a cgroup only contains a process, they looks almost the same. But with > >> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or > >> more than rss > >> in proc/[pid]/smaps_rollup as follow: > >> > >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes > >> 1080930304 > >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs > >> 1290 > >> [root@localhost sda]# cat /proc/1290/smaps_rollup > >> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0 > >> [rollup] > >> Rss: 500648 kB > >> Pss: 498337 kB > >> Shared_Clean: 2732 kB > >> Shared_Dirty: 0 kB > >> Private_Clean: 364 kB > >> Private_Dirty: 497552 kB > >> Referenced: 500648 kB > >> Anonymous: 492016 kB > >> LazyFree: 0 kB > >> AnonHugePages: 129024 kB > >> ShmemPmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >> THPeligible: 0 > >> > >> I have found the differences was because that __split_huge_pmd decrease > >> the mm_counter but page_counter in memcg was not decreased with refcount > >> of a head page is not zero. Here are the follows: > >> > >> do_madvise > >> madvise_dontneed_free > >> zap_page_range > >> unmap_single_vma > >> zap_pud_range > >> zap_pmd_range > >> __split_huge_pmd > >> __split_huge_pmd_locked > >> __mod_lruvec_page_state > >> zap_pte_range > >> add_mm_rss_vec > >> add_mm_counter -> decrease the > >> mm_counter > >> tlb_finish_mmu > >> arch_tlb_finish_mmu > >> tlb_flush_mmu_free > >> free_pages_and_swap_cache > >> release_pages > >> folio_put_testzero(page) -> not zero, skip > >> continue; > >> __folio_put_large > >> free_transhuge_page > >> free_compound_page > >> mem_cgroup_uncharge > >> page_counter_uncharge -> decrease the > >> page_counter > >> > >> node_page_stat which shows in meminfo was also decreased. the > >> __split_huge_pmd > >> seems free no physical memory unless the total THP was free.I am > >> confused which > >> one is the true physical memory used of a process. > > This should be caused by the deferred split of THP. When MADV_DONTNEED > > is called on the partial of the map, the huge PMD is split, but the > > THP itself will not be split until the memory pressure is hit (global > > or memcg limit). So the unmapped sub pages are actually not freed > > until that point. So the mm counter is decreased due to the zapping > > but the physical pages are not actually freed then uncharged from > > memcg. > > Thanks! > > I don't know how much memory a real workload will cost.So I just > > test the max_usage_in_bytes of memcg with THP disabled and add a little bit > > more for the limit_in_byte of memcg with THP enabled which trigger a oom... > > (actually it costed 100M more with THP enabled). Another testcase which I > > known the amout of memory will cost don't trigger a oom with suitable > > memcg limit and I see the THP split when the memory hit the limit. > > > I have another concern that k8s usually use (rss - files) to estimate Do you mean "workingset" used by some 3rd party k8s monitoring tools? I recall that depends on what monitoring tools you use, for example, some monitoring use active_anon + active_file. > > the memory workload but the anon_thp in the defered list charged > > in memcg will make it look higher than actucal. And it seems the Yes, but the deferred split shrinker should handle this quite gracefully. > > container will be killed without oom... If you have some userspace daemons which monitor the memory usage by rss, and try to behave smarter to kill the container by looking at rss solely, you may kill the container prematurely. > > Is it suitable to add meminfo of a deferred split list of THP? We could, but I don't think of how it will be used to improve the usecase. Any more thoughts? > > >> > >> Kind regards, > >> > >> Yongqiang Liu > >> > >> > > . ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled 2022-11-29 17:23 ` Yang Shi @ 2022-12-01 2:22 ` Yongqiang Liu 0 siblings, 0 replies; 8+ messages in thread From: Yongqiang Liu @ 2022-12-01 2:22 UTC (permalink / raw) To: Yang Shi Cc: linux-kernel, linux-mm, akpm, aarcange, hughd, mgorman, mhocko, cl, zokeefe, rientjes, Matthew Wilcox, peterx, Wangkefeng (OS Kernel Lab), zhangxiaoxu (A), kirill.shutemov, Lu Jialin 在 2022/11/30 1:23, Yang Shi 写道: > On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote: >> >> 在 2022/11/29 4:01, Yang Shi 写道: >>> On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote: >>>> Hi, >>>> >>>> We use mm_counter to how much a process physical memory used. Meanwhile, >>>> page_counter of a memcg is used to count how much a cgroup physical >>>> memory used. >>>> If a cgroup only contains a process, they looks almost the same. But with >>>> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or >>>> more than rss >>>> in proc/[pid]/smaps_rollup as follow: >>>> >>>> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes >>>> 1080930304 >>>> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs >>>> 1290 >>>> [root@localhost sda]# cat /proc/1290/smaps_rollup >>>> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0 >>>> [rollup] >>>> Rss: 500648 kB >>>> Pss: 498337 kB >>>> Shared_Clean: 2732 kB >>>> Shared_Dirty: 0 kB >>>> Private_Clean: 364 kB >>>> Private_Dirty: 497552 kB >>>> Referenced: 500648 kB >>>> Anonymous: 492016 kB >>>> LazyFree: 0 kB >>>> AnonHugePages: 129024 kB >>>> ShmemPmdMapped: 0 kB >>>> Shared_Hugetlb: 0 kB >>>> Private_Hugetlb: 0 kB >>>> Swap: 0 kB >>>> SwapPss: 0 kB >>>> Locked: 0 kB >>>> THPeligible: 0 >>>> >>>> I have found the differences was because that __split_huge_pmd decrease >>>> the mm_counter but page_counter in memcg was not decreased with refcount >>>> of a head page is not zero. Here are the follows: >>>> >>>> do_madvise >>>> madvise_dontneed_free >>>> zap_page_range >>>> unmap_single_vma >>>> zap_pud_range >>>> zap_pmd_range >>>> __split_huge_pmd >>>> __split_huge_pmd_locked >>>> __mod_lruvec_page_state >>>> zap_pte_range >>>> add_mm_rss_vec >>>> add_mm_counter -> decrease the >>>> mm_counter >>>> tlb_finish_mmu >>>> arch_tlb_finish_mmu >>>> tlb_flush_mmu_free >>>> free_pages_and_swap_cache >>>> release_pages >>>> folio_put_testzero(page) -> not zero, skip >>>> continue; >>>> __folio_put_large >>>> free_transhuge_page >>>> free_compound_page >>>> mem_cgroup_uncharge >>>> page_counter_uncharge -> decrease the >>>> page_counter >>>> >>>> node_page_stat which shows in meminfo was also decreased. the >>>> __split_huge_pmd >>>> seems free no physical memory unless the total THP was free.I am >>>> confused which >>>> one is the true physical memory used of a process. >>> This should be caused by the deferred split of THP. When MADV_DONTNEED >>> is called on the partial of the map, the huge PMD is split, but the >>> THP itself will not be split until the memory pressure is hit (global >>> or memcg limit). So the unmapped sub pages are actually not freed >>> until that point. So the mm counter is decreased due to the zapping >>> but the physical pages are not actually freed then uncharged from >>> memcg. >> Thanks! >> >> I don't know how much memory a real workload will cost.So I just >> >> test the max_usage_in_bytes of memcg with THP disabled and add a little bit >> >> more for the limit_in_byte of memcg with THP enabled which trigger a oom... >> >> (actually it costed 100M more with THP enabled). Another testcase which I >> >> known the amout of memory will cost don't trigger a oom with suitable >> >> memcg limit and I see the THP split when the memory hit the limit. >> >> >> I have another concern that k8s usually use (rss - files) to estimate > Do you mean "workingset" used by some 3rd party k8s monitoring tools? > I recall that depends on what monitoring tools you use, for example, > some monitoring use active_anon + active_file. Yes, I notice the k8s use a parent pod which set a memcg limit to cover all child pods, and workingset monitor is watch the root memcg. > >> the memory workload but the anon_thp in the defered list charged >> >> in memcg will make it look higher than actucal. And it seems the > Yes, but the deferred split shrinker should handle this quite gracefully. > >> container will be killed without oom... > If you have some userspace daemons which monitor the memory usage by > rss, and try to behave smarter to kill the container by looking at rss > solely, you may kill the container prematurely. Thanks. > >> Is it suitable to add meminfo of a deferred split list of THP? > We could, but I don't think of how it will be used to improve the > usecase. Any more thoughts? In current k8s scenario, I think it will not kill the container with the parent pod memcg limit set correctly. Maybe the meminfo with a split interface will be helpful for user to release memory in advance. >>>> Kind regards, >>>> >>>> Yongqiang Liu >>>> >>>> >>> . > . ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-12-01 2:22 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-11-26 13:09 [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled Yongqiang Liu 2022-11-28 20:01 ` Yang Shi 2022-11-29 8:10 ` Michal Hocko 2022-11-29 13:19 ` Yongqiang Liu 2022-11-29 17:49 ` Yang Shi 2022-11-29 13:14 ` Yongqiang Liu 2022-11-29 17:23 ` Yang Shi 2022-12-01 2:22 ` Yongqiang Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox