From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37741C4332F for ; Tue, 29 Nov 2022 17:23:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B0CEC6B0081; Tue, 29 Nov 2022 12:23:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id ABD186B0082; Tue, 29 Nov 2022 12:23:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 985586B0083; Tue, 29 Nov 2022 12:23:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 8802E6B0081 for ; Tue, 29 Nov 2022 12:23:23 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 4ADAB40A18 for ; Tue, 29 Nov 2022 17:23:22 +0000 (UTC) X-FDA: 80187151044.14.6A7664C Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) by imf24.hostedemail.com (Postfix) with ESMTP id A0167180012 for ; Tue, 29 Nov 2022 17:23:21 +0000 (UTC) Received: by mail-pj1-f49.google.com with SMTP id o12so5475821pjo.4 for ; Tue, 29 Nov 2022 09:23:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=u9yTJMGnmqe1ocfWJaFwIJrFk9X2tNjSo/XyqrEyYTg=; b=NVAAIrOm+zwhwAynH4tmwyAJDzvlUYizLXYBektwykRzEZ8ornpx9Y1h+cAltGYSLa EH0GRdXMKm1U3UABCWyfNANg0e6kpRBepsqJERzIfRXbXc0yJ//dMbqVZC57iaPphEug naAicUgU1HyCVFoxsELNPD+o2xFXmnrKZq3ryabT3rdvG79uDl5T81/fsuTnJaOcelGt 7DDuJR2EBGojKe0qOGs8xN2k7Sa7FHqi4YGaOCjGtPD6dayM22FnTFh7sp7jldhQx8Bt cmrkBUT6L/2b5kvbwKHeDP/1tj+25OwV3y1pRKG+TKeBdLaCVwB3GEsn+SGL/SSWkqvy W5Tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=u9yTJMGnmqe1ocfWJaFwIJrFk9X2tNjSo/XyqrEyYTg=; b=n7NSLXGm56pVxPim3VJSn1mGG/NSepQlFL7RCUKfxc0QD1Nf1lJUonJgZ5idKHAtpU na8ueDtV2X57bqb+ThBlJNsMa/h5jPcOIGt7Z4WSZ8beU1bJJx+O7b9hRP31zX1fRcCA WpckNT6hmXigfQy8BeGYO9n3PiVpioVARzdHJfZ79dpo/Gm4ZPLE6FhRPmbGkdOpFWCX gy4vBo7ZGxLXSHFGudJjlpkdrROUX6Np2C3iJRxnkL6Hf3VDWQZMdnN84FY2mH/AOo+Z hMkxz8jZ0oUpgDD4m2UOYnNK/sEO4FcLsJguC9dWAkwz07hMM9WeDBg6cx8ZqsDUOSMa 458w== X-Gm-Message-State: ANoB5pna5H/28pwuLj1E3fQT1GEfCZe/ljMHHgajZRrobdeBmJ6JbGSP UiRKnt4HTz4hMmRAmNCKb7M7B6c7HQwiuOe5U2g= X-Google-Smtp-Source: AA0mqf5TgaOtnh6ZD+VZ+Y2+ooxF3sLbJ8NQpWTrM/UdV9o/8xW0FUwnbSE7bNNiqCto4rmu6ijGBu3v5aKU0BVDhBo= X-Received: by 2002:a17:903:258f:b0:189:754b:9d9e with SMTP id jb15-20020a170903258f00b00189754b9d9emr18087055plb.119.1669742600312; Tue, 29 Nov 2022 09:23:20 -0800 (PST) MIME-Version: 1.0 References: <8a2f2644-71d0-05d7-49d8-878aafa99652@huawei.com> <6b7142cf-386e-e1d2-a122-b923337a593e@huawei.com> In-Reply-To: <6b7142cf-386e-e1d2-a122-b923337a593e@huawei.com> From: Yang Shi Date: Tue, 29 Nov 2022 09:23:08 -0800 Message-ID: Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled To: Yongqiang Liu Cc: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "akpm@linux-foundation.org" , aarcange@redhat.com, hughd@google.com, mgorman@suse.de, mhocko@suse.cz, cl@gentwo.org, zokeefe@google.com, rientjes@google.com, Matthew Wilcox , peterx@redhat.com, "Wangkefeng (OS Kernel Lab)" , "zhangxiaoxu (A)" , kirill.shutemov@linux.intel.com, Lu Jialin Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=NVAAIrOm; spf=pass (imf24.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.49 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669742601; a=rsa-sha256; cv=none; b=sNjYjxUkk3kFmoYdVrracCe7+Bp4+jK9kXxH/4bvi6mdqWM/wWxjfNDQ/LJ0/M33cQooun CAhCl9CQId0OXNgH6F6ZdPF0qTd7pq44Ko+6jVwX4j0kjNVji4o/+2BEQJV/0HjJ9QTYSw E+eROf2TEEFRKMIOTMUz20P5ZQKDxEA= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1669742601; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=u9yTJMGnmqe1ocfWJaFwIJrFk9X2tNjSo/XyqrEyYTg=; b=TX5RWuPmqWRmnsscokyRrcuPBrf4uHFI9dDxzplNRC8wzJIpFsXR0qrfRMvT2o5i4cgyKy LwIJdHYPMgSyfxjoMrt8luhC0NmcZ35m5groDSx8O20tuywFeC2xwhEnlWIzQhVtGr4mCr l1ppd3V0iJJaJMYdTuHQVem5eyCu4kg= Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=NVAAIrOm; spf=pass (imf24.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.49 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspamd-Server: rspam01 X-Stat-Signature: 5xdice6cwq1bib4szjuaigat8sb64xnb X-Rspamd-Queue-Id: A0167180012 X-Rspam-User: X-HE-Tag: 1669742601-585789 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu w= rote: > > > =E5=9C=A8 2022/11/29 4:01, Yang Shi =E5=86=99=E9=81=93: > > On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu wrote: > >> Hi, > >> > >> We use mm_counter to how much a process physical memory used. Meanwhil= e, > >> page_counter of a memcg is used to count how much a cgroup physical > >> memory used. > >> If a cgroup only contains a process, they looks almost the same. But w= ith > >> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or > >> more than rss > >> in proc/[pid]/smaps_rollup as follow: > >> > >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_b= ytes > >> 1080930304 > >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs > >> 1290 > >> [root@localhost sda]# cat /proc/1290/smaps_rollup > >> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0 > >> [rollup] > >> Rss: 500648 kB > >> Pss: 498337 kB > >> Shared_Clean: 2732 kB > >> Shared_Dirty: 0 kB > >> Private_Clean: 364 kB > >> Private_Dirty: 497552 kB > >> Referenced: 500648 kB > >> Anonymous: 492016 kB > >> LazyFree: 0 kB > >> AnonHugePages: 129024 kB > >> ShmemPmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >> THPeligible: 0 > >> > >> I have found the differences was because that __split_huge_pmd decreas= e > >> the mm_counter but page_counter in memcg was not decreased with refcou= nt > >> of a head page is not zero. Here are the follows: > >> > >> do_madvise > >> madvise_dontneed_free > >> zap_page_range > >> unmap_single_vma > >> zap_pud_range > >> zap_pmd_range > >> __split_huge_pmd > >> __split_huge_pmd_locked > >> __mod_lruvec_page_state > >> zap_pte_range > >> add_mm_rss_vec > >> add_mm_counter -> decrease the > >> mm_counter > >> tlb_finish_mmu > >> arch_tlb_finish_mmu > >> tlb_flush_mmu_free > >> free_pages_and_swap_cache > >> release_pages > >> folio_put_testzero(page) -> not zero, ski= p > >> continue; > >> __folio_put_large > >> free_transhuge_page > >> free_compound_page > >> mem_cgroup_uncharge > >> page_counter_uncharge -> decrease the > >> page_counter > >> > >> node_page_stat which shows in meminfo was also decreased. the > >> __split_huge_pmd > >> seems free no physical memory unless the total THP was free.I am > >> confused which > >> one is the true physical memory used of a process. > > This should be caused by the deferred split of THP. When MADV_DONTNEED > > is called on the partial of the map, the huge PMD is split, but the > > THP itself will not be split until the memory pressure is hit (global > > or memcg limit). So the unmapped sub pages are actually not freed > > until that point. So the mm counter is decreased due to the zapping > > but the physical pages are not actually freed then uncharged from > > memcg. > > Thanks! > > I don't know how much memory a real workload will cost.So I just > > test the max_usage_in_bytes of memcg with THP disabled and add a little b= it > > more for the limit_in_byte of memcg with THP enabled which trigger a oom.= .. > > (actually it costed 100M more with THP enabled). Another testcase which I > > known the amout of memory will cost don't trigger a oom with suitable > > memcg limit and I see the THP split when the memory hit the limit. > > > I have another concern that k8s usually use (rss - files) to estimate Do you mean "workingset" used by some 3rd party k8s monitoring tools? I recall that depends on what monitoring tools you use, for example, some monitoring use active_anon + active_file. > > the memory workload but the anon_thp in the defered list charged > > in memcg will make it look higher than actucal. And it seems the Yes, but the deferred split shrinker should handle this quite gracefully. > > container will be killed without oom... If you have some userspace daemons which monitor the memory usage by rss, and try to behave smarter to kill the container by looking at rss solely, you may kill the container prematurely. > > Is it suitable to add meminfo of a deferred split list of THP? We could, but I don't think of how it will be used to improve the usecase. Any more thoughts? > > >> > >> Kind regards, > >> > >> Yongqiang Liu > >> > >> > > .