From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D5747C4321E
	for <linux-mm@archiver.kernel.org>; Thu,  1 Dec 2022 02:22:55 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2FB936B0072; Wed, 30 Nov 2022 21:22:55 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2ABE96B0073; Wed, 30 Nov 2022 21:22:55 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 174796B0074; Wed, 30 Nov 2022 21:22:55 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 037CD6B0072
	for <linux-mm@kvack.org>; Wed, 30 Nov 2022 21:22:55 -0500 (EST)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id C99A6A02A3
	for <linux-mm@kvack.org>; Thu,  1 Dec 2022 02:22:54 +0000 (UTC)
X-FDA: 80192139468.30.C73BAF7
Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189])
	by imf21.hostedemail.com (Postfix) with ESMTP id 992AA1C000F
	for <linux-mm@kvack.org>; Thu,  1 Dec 2022 02:22:52 +0000 (UTC)
Received: from dggpeml500022.china.huawei.com (unknown [172.30.72.55])
	by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4NN0Bd0X3PzJp30;
	Thu,  1 Dec 2022 10:19:21 +0800 (CST)
Received: from dggpeml500005.china.huawei.com (7.185.36.59) by
 dggpeml500022.china.huawei.com (7.185.36.66) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.31; Thu, 1 Dec 2022 10:22:32 +0800
Received: from [10.174.178.155] (10.174.178.155) by
 dggpeml500005.china.huawei.com (7.185.36.59) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.31; Thu, 1 Dec 2022 10:22:32 +0800
Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with
 THP enabled
To: Yang Shi <shy828301@gmail.com>
CC: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>, "akpm@linux-foundation.org"
	<akpm@linux-foundation.org>, <aarcange@redhat.com>, <hughd@google.com>,
	<mgorman@suse.de>, <mhocko@suse.cz>, <cl@gentwo.org>, <zokeefe@google.com>,
	<rientjes@google.com>, Matthew Wilcox <willy@infradead.org>,
	<peterx@redhat.com>, "Wangkefeng (OS Kernel Lab)"
	<wangkefeng.wang@huawei.com>, "zhangxiaoxu (A)" <zhangxiaoxu5@huawei.com>,
	<kirill.shutemov@linux.intel.com>, Lu Jialin <lujialin4@huawei.com>
References: <8a2f2644-71d0-05d7-49d8-878aafa99652@huawei.com>
 <CAHbLzkr-eXk8gateN=EmMoBuW3wxoQKTCfJcTRQsQX3QxD+CmA@mail.gmail.com>
 <6b7142cf-386e-e1d2-a122-b923337a593e@huawei.com>
 <CAHbLzkpENMxuPQdHyehR_kMO8msAbGtHC+N=VD0eE7Nkeo799Q@mail.gmail.com>
From: Yongqiang Liu <liuyongqiang13@huawei.com>
Message-ID: <77174872-b823-3d29-1a9f-d0a9a19c3157@huawei.com>
Date: Thu, 1 Dec 2022 10:22:31 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.0
MIME-Version: 1.0
In-Reply-To: <CAHbLzkpENMxuPQdHyehR_kMO8msAbGtHC+N=VD0eE7Nkeo799Q@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.178.155]
X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To
 dggpeml500005.china.huawei.com (7.185.36.59)
X-CFilter-Loop: Reflected
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669861374; a=rsa-sha256;
	cv=none;
	b=UIJuiZs4AyVGYZEmuv6mTSaUI/dZsjRzeDUB54fvB2JxeOOi1lKhbrvPK79pCi0QMotd0u
	j6kzMHAhoWhBQUINll6haguUSMAtwDo2LY+squFWiBiooF7Vw0w8WfWRD6+11D0iTg3xK6
	JAbUJ3xJUAf/CWbiY98FwDAd1dtW8Yk=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf21.hostedemail.com: domain of liuyongqiang13@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=liuyongqiang13@huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1669861374;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=qbfJ0zxltDO8OxhsclcoolZov75HCbGPWrIU40In56E=;
	b=tOmFcxOEJj2OIDtAtiZTMI+l7dUVKsB9oDQoJ3a1GPvvNvLsWDTnUWvFNJjANINa09BDrp
	qBK4WpLrWPiAs8ehNN5rEk6M93buNStZ1OE1Fu0CrNUhV2ZUNdkFGsghOljAO/pylCgN0z
	wtzU7SD4zedmm0xDpBLACtjFF/i9F1A=
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 992AA1C000F
X-Rspam-User: 
Authentication-Results: imf21.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf21.hostedemail.com: domain of liuyongqiang13@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=liuyongqiang13@huawei.com
X-Stat-Signature: h6ymn15us9nsn4wu6r7cf87jp6kmceqf
X-HE-Tag: 1669861372-640300
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


在 2022/11/30 1:23, Yang Shi 写道:
> On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote:
>>
>> 在 2022/11/29 4:01, Yang Shi 写道:
>>> On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@huawei.com> wrote:
>>>> Hi,
>>>>
>>>> We use mm_counter to how much a process physical memory used. Meanwhile,
>>>> page_counter of a memcg is used to count how much a cgroup physical
>>>> memory used.
>>>> If a cgroup only contains a process, they looks almost the same. But with
>>>> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
>>>> more than rss
>>>> in proc/[pid]/smaps_rollup as follow:
>>>>
>>>> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes
>>>> 1080930304
>>>> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs
>>>> 1290
>>>> [root@localhost sda]# cat /proc/1290/smaps_rollup
>>>> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0
>>>> [rollup]
>>>> Rss:              500648 kB
>>>> Pss:              498337 kB
>>>> Shared_Clean:       2732 kB
>>>> Shared_Dirty:          0 kB
>>>> Private_Clean:       364 kB
>>>> Private_Dirty:    497552 kB
>>>> Referenced:       500648 kB
>>>> Anonymous:        492016 kB
>>>> LazyFree:              0 kB
>>>> AnonHugePages:    129024 kB
>>>> ShmemPmdMapped:        0 kB
>>>> Shared_Hugetlb:        0 kB
>>>> Private_Hugetlb:       0 kB
>>>> Swap:                  0 kB
>>>> SwapPss:               0 kB
>>>> Locked:                0 kB
>>>> THPeligible:    0
>>>>
>>>> I have found the differences was because that __split_huge_pmd decrease
>>>> the mm_counter but page_counter in memcg was not decreased with refcount
>>>> of a head page is not zero. Here are the follows:
>>>>
>>>> do_madvise
>>>>      madvise_dontneed_free
>>>>        zap_page_range
>>>>          unmap_single_vma
>>>>            zap_pud_range
>>>>              zap_pmd_range
>>>>                __split_huge_pmd
>>>>                  __split_huge_pmd_locked
>>>>                    __mod_lruvec_page_state
>>>>                zap_pte_range
>>>>                   add_mm_rss_vec
>>>>                      add_mm_counter                    -> decrease the
>>>> mm_counter
>>>>          tlb_finish_mmu
>>>>            arch_tlb_finish_mmu
>>>>              tlb_flush_mmu_free
>>>>                free_pages_and_swap_cache
>>>>                  release_pages
>>>>                    folio_put_testzero(page)            -> not zero, skip
>>>>                      continue;
>>>>                    __folio_put_large
>>>>                      free_transhuge_page
>>>>                        free_compound_page
>>>>                          mem_cgroup_uncharge
>>>>                            page_counter_uncharge        -> decrease the
>>>> page_counter
>>>>
>>>> node_page_stat which shows in meminfo was also decreased. the
>>>> __split_huge_pmd
>>>> seems free no physical memory unless the total THP was free.I am
>>>> confused which
>>>> one is the true physical memory used of a process.
>>> This should be caused by the deferred split of THP. When MADV_DONTNEED
>>> is called on the partial of the map, the huge PMD is split, but the
>>> THP itself will not be split until the memory pressure is hit (global
>>> or memcg limit). So the unmapped sub pages are actually not freed
>>> until that point. So the mm counter is decreased due to the zapping
>>> but the physical pages are not actually freed then uncharged from
>>> memcg.
>> Thanks!
>>
>> I don't know how much memory a real workload will cost.So I just
>>
>> test the max_usage_in_bytes of memcg with THP disabled and add a little bit
>>
>> more for the limit_in_byte of memcg with THP enabled which trigger a oom...
>>
>> (actually it costed 100M more with THP enabled). Another testcase which I
>>
>> known the amout of memory will cost don't trigger a oom with suitable
>>
>> memcg limit  and I see the THP split when the memory hit the limit.
>>
>>
>> I have another concern that k8s usually use (rss - files) to estimate
> Do you mean "workingset" used by some 3rd party k8s monitoring tools?
> I recall that depends on what monitoring tools you use, for example,
> some monitoring use active_anon + active_file.

Yes, I notice the k8s use a parent pod which set a memcg limit to cover all

child pods, and workingset monitor is watch the root memcg.

>
>> the memory workload but the anon_thp in the defered list charged
>>
>> in memcg will make it look higher than actucal. And it seems the
> Yes, but the deferred split shrinker should handle this quite gracefully.
>
>> container will be killed without oom...
> If you have some userspace daemons which monitor the memory usage by
> rss, and try to behave smarter to kill the container by looking at rss
> solely, you may kill the container prematurely.
Thanks.
>
>> Is it suitable to add meminfo of a deferred split list of THP?
> We could, but I don't think of how it will be used to improve the
> usecase. Any more thoughts?

In current k8s scenario, I think it will not kill the container with the 
parent

pod memcg limit set correctly.

Maybe  the meminfo with a split interface  will be helpful for user to

release memory in advance.

>>>> Kind regards,
>>>>
>>>> Yongqiang Liu
>>>>
>>>>
>>> .
> .