From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5747C4321E for ; Thu, 1 Dec 2022 02:22:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2FB936B0072; Wed, 30 Nov 2022 21:22:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2ABE96B0073; Wed, 30 Nov 2022 21:22:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 174796B0074; Wed, 30 Nov 2022 21:22:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 037CD6B0072 for ; Wed, 30 Nov 2022 21:22:55 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id C99A6A02A3 for ; Thu, 1 Dec 2022 02:22:54 +0000 (UTC) X-FDA: 80192139468.30.C73BAF7 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by imf21.hostedemail.com (Postfix) with ESMTP id 992AA1C000F for ; Thu, 1 Dec 2022 02:22:52 +0000 (UTC) Received: from dggpeml500022.china.huawei.com (unknown [172.30.72.55]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4NN0Bd0X3PzJp30; Thu, 1 Dec 2022 10:19:21 +0800 (CST) Received: from dggpeml500005.china.huawei.com (7.185.36.59) by dggpeml500022.china.huawei.com (7.185.36.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Thu, 1 Dec 2022 10:22:32 +0800 Received: from [10.174.178.155] (10.174.178.155) by dggpeml500005.china.huawei.com (7.185.36.59) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Thu, 1 Dec 2022 10:22:32 +0800 Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled To: Yang Shi CC: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "akpm@linux-foundation.org" , , , , , , , , Matthew Wilcox , , "Wangkefeng (OS Kernel Lab)" , "zhangxiaoxu (A)" , , Lu Jialin References: <8a2f2644-71d0-05d7-49d8-878aafa99652@huawei.com> <6b7142cf-386e-e1d2-a122-b923337a593e@huawei.com> From: Yongqiang Liu Message-ID: <77174872-b823-3d29-1a9f-d0a9a19c3157@huawei.com> Date: Thu, 1 Dec 2022 10:22:31 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.178.155] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpeml500005.china.huawei.com (7.185.36.59) X-CFilter-Loop: Reflected ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669861374; a=rsa-sha256; cv=none; b=UIJuiZs4AyVGYZEmuv6mTSaUI/dZsjRzeDUB54fvB2JxeOOi1lKhbrvPK79pCi0QMotd0u j6kzMHAhoWhBQUINll6haguUSMAtwDo2LY+squFWiBiooF7Vw0w8WfWRD6+11D0iTg3xK6 JAbUJ3xJUAf/CWbiY98FwDAd1dtW8Yk= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf21.hostedemail.com: domain of liuyongqiang13@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=liuyongqiang13@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1669861374; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qbfJ0zxltDO8OxhsclcoolZov75HCbGPWrIU40In56E=; b=tOmFcxOEJj2OIDtAtiZTMI+l7dUVKsB9oDQoJ3a1GPvvNvLsWDTnUWvFNJjANINa09BDrp qBK4WpLrWPiAs8ehNN5rEk6M93buNStZ1OE1Fu0CrNUhV2ZUNdkFGsghOljAO/pylCgN0z wtzU7SD4zedmm0xDpBLACtjFF/i9F1A= X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 992AA1C000F X-Rspam-User: Authentication-Results: imf21.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf21.hostedemail.com: domain of liuyongqiang13@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=liuyongqiang13@huawei.com X-Stat-Signature: h6ymn15us9nsn4wu6r7cf87jp6kmceqf X-HE-Tag: 1669861372-640300 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 在 2022/11/30 1:23, Yang Shi 写道: > On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu wrote: >> >> 在 2022/11/29 4:01, Yang Shi 写道: >>> On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu wrote: >>>> Hi, >>>> >>>> We use mm_counter to how much a process physical memory used. Meanwhile, >>>> page_counter of a memcg is used to count how much a cgroup physical >>>> memory used. >>>> If a cgroup only contains a process, they looks almost the same. But with >>>> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or >>>> more than rss >>>> in proc/[pid]/smaps_rollup as follow: >>>> >>>> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes >>>> 1080930304 >>>> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs >>>> 1290 >>>> [root@localhost sda]# cat /proc/1290/smaps_rollup >>>> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0 >>>> [rollup] >>>> Rss: 500648 kB >>>> Pss: 498337 kB >>>> Shared_Clean: 2732 kB >>>> Shared_Dirty: 0 kB >>>> Private_Clean: 364 kB >>>> Private_Dirty: 497552 kB >>>> Referenced: 500648 kB >>>> Anonymous: 492016 kB >>>> LazyFree: 0 kB >>>> AnonHugePages: 129024 kB >>>> ShmemPmdMapped: 0 kB >>>> Shared_Hugetlb: 0 kB >>>> Private_Hugetlb: 0 kB >>>> Swap: 0 kB >>>> SwapPss: 0 kB >>>> Locked: 0 kB >>>> THPeligible: 0 >>>> >>>> I have found the differences was because that __split_huge_pmd decrease >>>> the mm_counter but page_counter in memcg was not decreased with refcount >>>> of a head page is not zero. Here are the follows: >>>> >>>> do_madvise >>>> madvise_dontneed_free >>>> zap_page_range >>>> unmap_single_vma >>>> zap_pud_range >>>> zap_pmd_range >>>> __split_huge_pmd >>>> __split_huge_pmd_locked >>>> __mod_lruvec_page_state >>>> zap_pte_range >>>> add_mm_rss_vec >>>> add_mm_counter -> decrease the >>>> mm_counter >>>> tlb_finish_mmu >>>> arch_tlb_finish_mmu >>>> tlb_flush_mmu_free >>>> free_pages_and_swap_cache >>>> release_pages >>>> folio_put_testzero(page) -> not zero, skip >>>> continue; >>>> __folio_put_large >>>> free_transhuge_page >>>> free_compound_page >>>> mem_cgroup_uncharge >>>> page_counter_uncharge -> decrease the >>>> page_counter >>>> >>>> node_page_stat which shows in meminfo was also decreased. the >>>> __split_huge_pmd >>>> seems free no physical memory unless the total THP was free.I am >>>> confused which >>>> one is the true physical memory used of a process. >>> This should be caused by the deferred split of THP. When MADV_DONTNEED >>> is called on the partial of the map, the huge PMD is split, but the >>> THP itself will not be split until the memory pressure is hit (global >>> or memcg limit). So the unmapped sub pages are actually not freed >>> until that point. So the mm counter is decreased due to the zapping >>> but the physical pages are not actually freed then uncharged from >>> memcg. >> Thanks! >> >> I don't know how much memory a real workload will cost.So I just >> >> test the max_usage_in_bytes of memcg with THP disabled and add a little bit >> >> more for the limit_in_byte of memcg with THP enabled which trigger a oom... >> >> (actually it costed 100M more with THP enabled). Another testcase which I >> >> known the amout of memory will cost don't trigger a oom with suitable >> >> memcg limit and I see the THP split when the memory hit the limit. >> >> >> I have another concern that k8s usually use (rss - files) to estimate > Do you mean "workingset" used by some 3rd party k8s monitoring tools? > I recall that depends on what monitoring tools you use, for example, > some monitoring use active_anon + active_file. Yes, I notice the k8s use a parent pod which set a memcg limit to cover all child pods, and workingset monitor is watch the root memcg. > >> the memory workload but the anon_thp in the defered list charged >> >> in memcg will make it look higher than actucal. And it seems the > Yes, but the deferred split shrinker should handle this quite gracefully. > >> container will be killed without oom... > If you have some userspace daemons which monitor the memory usage by > rss, and try to behave smarter to kill the container by looking at rss > solely, you may kill the container prematurely. Thanks. > >> Is it suitable to add meminfo of a deferred split list of THP? > We could, but I don't think of how it will be used to improve the > usecase. Any more thoughts? In current k8s scenario, I think it will not kill the container with the parent pod memcg limit set correctly. Maybe  the meminfo with a split interface  will be helpful for user to release memory in advance. >>>> Kind regards, >>>> >>>> Yongqiang Liu >>>> >>>> >>> . > .