From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 17A2FC46467 for ; Tue, 29 Nov 2022 13:32:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 277C86B0075; Tue, 29 Nov 2022 08:32:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2277A6B0078; Tue, 29 Nov 2022 08:32:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 116D56B007B; Tue, 29 Nov 2022 08:32:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 049786B0075 for ; Tue, 29 Nov 2022 08:32:03 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 877AD1C677B for ; Tue, 29 Nov 2022 13:14:38 +0000 (UTC) X-FDA: 80186524236.20.1228F74 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf21.hostedemail.com (Postfix) with ESMTP id 09F921C0019 for ; Tue, 29 Nov 2022 13:14:35 +0000 (UTC) Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.55]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4NM2kr4lHtzqSmK; Tue, 29 Nov 2022 21:10:28 +0800 (CST) Received: from dggpeml500005.china.huawei.com (7.185.36.59) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 29 Nov 2022 21:14:30 +0800 Received: from [10.174.178.155] (10.174.178.155) by dggpeml500005.china.huawei.com (7.185.36.59) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 29 Nov 2022 21:14:29 +0800 Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled To: Yang Shi CC: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "akpm@linux-foundation.org" , , , , , , , , , Matthew Wilcox , , "Wangkefeng (OS Kernel Lab)" , "zhangxiaoxu (A)" , , Lu Jialin References: <8a2f2644-71d0-05d7-49d8-878aafa99652@huawei.com> From: Yongqiang Liu Message-ID: <6b7142cf-386e-e1d2-a122-b923337a593e@huawei.com> Date: Tue, 29 Nov 2022 21:14:29 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.178.155] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpeml500005.china.huawei.com (7.185.36.59) X-CFilter-Loop: Reflected ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1669727677; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FkmNuDhKeUFI2pkiYGh7O+DfBdhwKW10eDv3iM1wNWs=; b=ujneNCUT8eQHvYOWkghC5G5UVFlXTKrpftI6TnGvMX/nibp7HcWJNfNz0YYqQnAvAPIl3Y MUIcNvWykMJboReUgTwoy8Kzymk+KhdSZP+IhcrZ8sJIwOwPzLmhkwgIub1IDOQtn1o6AO FyO58unVgp6euAJu8axsdBqNbFNwLEA= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf21.hostedemail.com: domain of liuyongqiang13@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=liuyongqiang13@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669727677; a=rsa-sha256; cv=none; b=GHZ7dTGE6Q6EaXbDzd1wFfy2IYBoxxX5kRaQZaSlTv9onIcP9CwznMT8RTtW8u53rgDOX0 J85Jv/OaR7LEaRkmpYOYOsyZ4zi/DhwO3inPnH89m6AaZqrDuyYG3W0+Hss2I98eGBAqAc p7QY8cp5AR5t5F5K0g5CjnkBE81y+Uc= X-Stat-Signature: mx6t7zq6shksjdd6pf96ktcg15ter6oa X-Rspamd-Queue-Id: 09F921C0019 Authentication-Results: imf21.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf21.hostedemail.com: domain of liuyongqiang13@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=liuyongqiang13@huawei.com X-Rspamd-Server: rspam06 X-Rspam-User: X-HE-Tag: 1669727675-643212 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 在 2022/11/29 4:01, Yang Shi 写道: > On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu wrote: >> Hi, >> >> We use mm_counter to how much a process physical memory used. Meanwhile, >> page_counter of a memcg is used to count how much a cgroup physical >> memory used. >> If a cgroup only contains a process, they looks almost the same. But with >> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or >> more than rss >> in proc/[pid]/smaps_rollup as follow: >> >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes >> 1080930304 >> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs >> 1290 >> [root@localhost sda]# cat /proc/1290/smaps_rollup >> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0 >> [rollup] >> Rss: 500648 kB >> Pss: 498337 kB >> Shared_Clean: 2732 kB >> Shared_Dirty: 0 kB >> Private_Clean: 364 kB >> Private_Dirty: 497552 kB >> Referenced: 500648 kB >> Anonymous: 492016 kB >> LazyFree: 0 kB >> AnonHugePages: 129024 kB >> ShmemPmdMapped: 0 kB >> Shared_Hugetlb: 0 kB >> Private_Hugetlb: 0 kB >> Swap: 0 kB >> SwapPss: 0 kB >> Locked: 0 kB >> THPeligible: 0 >> >> I have found the differences was because that __split_huge_pmd decrease >> the mm_counter but page_counter in memcg was not decreased with refcount >> of a head page is not zero. Here are the follows: >> >> do_madvise >> madvise_dontneed_free >> zap_page_range >> unmap_single_vma >> zap_pud_range >> zap_pmd_range >> __split_huge_pmd >> __split_huge_pmd_locked >> __mod_lruvec_page_state >> zap_pte_range >> add_mm_rss_vec >> add_mm_counter -> decrease the >> mm_counter >> tlb_finish_mmu >> arch_tlb_finish_mmu >> tlb_flush_mmu_free >> free_pages_and_swap_cache >> release_pages >> folio_put_testzero(page) -> not zero, skip >> continue; >> __folio_put_large >> free_transhuge_page >> free_compound_page >> mem_cgroup_uncharge >> page_counter_uncharge -> decrease the >> page_counter >> >> node_page_stat which shows in meminfo was also decreased. the >> __split_huge_pmd >> seems free no physical memory unless the total THP was free.I am >> confused which >> one is the true physical memory used of a process. > This should be caused by the deferred split of THP. When MADV_DONTNEED > is called on the partial of the map, the huge PMD is split, but the > THP itself will not be split until the memory pressure is hit (global > or memcg limit). So the unmapped sub pages are actually not freed > until that point. So the mm counter is decreased due to the zapping > but the physical pages are not actually freed then uncharged from > memcg. Thanks! I don't know how much memory a real workload will cost.So I just test the max_usage_in_bytes of memcg with THP disabled and add a little bit more for the limit_in_byte of memcg with THP enabled which trigger a oom... (actually it costed 100M more with THP enabled). Another testcase which I known the amout of memory will cost don't trigger a oom with suitable memcg limit  and I see the THP split when the memory hit the limit. I have another concern that k8s usually use (rss - files) to estimate the memory workload but the anon_thp in the defered list charged in memcg will make it look higher than actucal. And it seems the container will be killed without oom... Is it suitable to add meminfo of a deferred split list of THP? >> >> Kind regards, >> >> Yongqiang Liu >> >> > .