From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F09FBC5AE59 for ; Thu, 29 May 2025 08:28:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 903D36B011B; Thu, 29 May 2025 04:28:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8DBF86B011C; Thu, 29 May 2025 04:28:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 819476B011D; Thu, 29 May 2025 04:28:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 62D486B011B for ; Thu, 29 May 2025 04:28:09 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 17216BE163 for ; Thu, 29 May 2025 08:28:09 +0000 (UTC) X-FDA: 83495267898.29.61562D4 Received: from out30-97.freemail.mail.aliyun.com (out30-97.freemail.mail.aliyun.com [115.124.30.97]) by imf22.hostedemail.com (Postfix) with ESMTP id 6AD11C0002 for ; Thu, 29 May 2025 08:28:06 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=QUrUVBhQ; spf=pass (imf22.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748507287; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ngl560kZEKSyWNZckI7eA5BnwDW53BnjR6Q03uPelj0=; b=L6uVgoQiy1DaypASYBBXHNyUkA534zZWLxQTbGu0bPpLea69+tt4tV7Cl3GVrLd4Engliy z9U1MLZtbhTyhBDNYdpuqIBn2EJWxNid762wMpBJRZCoOnyikLsoBqkVtaWXn/WWhndGrG Q6RAEeAexrqKlnxQBrINFFg8vRnYmeU= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=QUrUVBhQ; spf=pass (imf22.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748507287; a=rsa-sha256; cv=none; b=On0sUV6NwCNP6REAzuaDA33hF7698cxzap7hDIc+gdgbtV2P/LJ2YQ2CVW972QnX+5JSmf AQ4jp7ORN4Gu6Py1hpOuNJgxx5pLEI1K3EaemiakbOmI+o218DS3OHLBYjXW4TbIZTI63c ughZvtXkijxxPSQqrF84z5uDa+rr4A4= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1748507283; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=ngl560kZEKSyWNZckI7eA5BnwDW53BnjR6Q03uPelj0=; b=QUrUVBhQQGU3jYMRiuOxhSpj8SIM1DddipMbrw1x/NraM6ErU+V0LTKYK8LP8ldwhpQFLzmF2P06F2eZfJyJCDu1se7P4EqO+lio/WWhtS1hNTEEcYSxluaiva8LB6fkk8Hv45bcwbRNltxxla4RvWa78MKaDDGzqs647w5AO5Y= Received: from 30.74.144.146(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WcGcH7v_1748507280 cluster:ay36) by smtp.aliyun-inc.com; Thu, 29 May 2025 16:28:01 +0800 Message-ID: <2610143f-3274-47c0-9a11-777be673c186@linux.alibaba.com> Date: Thu, 29 May 2025 16:27:59 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v7 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support To: Nico Pache Cc: David Hildenbrand , David Rientjes , zokeefe@google.com, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, ziy@nvidia.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, dev.jain@arm.com, corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, akpm@linux-foundation.org, baohua@kernel.org, willy@infradead.org, peterx@redhat.com, wangkefeng.wang@huawei.com, usamaarif642@gmail.com, sunnanyong@huawei.com, vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com, yang@os.amperecomputing.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, raquini@redhat.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com, jack@suse.cz, cl@gentwo.org, jglisse@google.com, surenb@google.com, hannes@cmpxchg.org, mhocko@suse.com, rdunlap@infradead.org References: <20250515032226.128900-1-npache@redhat.com> <20250515032226.128900-7-npache@redhat.com> <9c54397f-3cbf-4fa2-bf69-ba89613d355f@linux.alibaba.com> <1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 6AD11C0002 X-Stat-Signature: 4camp4ufy96oi3n6ot5ox9i7juo8o8yk X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1748507286-646175 X-HE-Meta: U2FsdGVkX1/3maJDfmboXtIg+/gcseeiSwyUSu7MXmSU4FS2vMC5m5uyVdyjeIWWEpf3b6miugnnH9J+8wlY0oAcbWDPZQWw/PHJoW8qAnmJ3C1wv7QKx4PKHzSTH6f8o0EHrnkmPAqovRkVkiQ7QtoLH+lH2Jkf1H0ASn5l9YPPX7ICg6Dk771xQDfkj4A0pVexCz43ZJKxUrcj5UgMoaeU/PNYMI3ql9E7XMP/xHcOSrjD2XCZ28B/ueq/Xsa/i/mQWfG3GTSyhiOtpmeFigjxcIIwHda359uchJ0MAQ+y1uzcSQmn6/Fvk0655gLFz/k9xoSVwER/H24NHn/YOhM2S5OLVl/4+uImzg1X0ekdYgmBVB/9yjUoBBU+h33OzlxPB9tyHbB7dqJdpYFSITPvJxH9KSxDGfl80YBulAN2HMl1Rjwu/Y+UNhudoGS8uehQ3COVmY9PZNzrrBBBSv3Af7Z7Jum4OrcF3sLnrfXass6bumxuNoTbcj5Mkq8/ZRA1MrgLl+I4pvnaIlA3cGioUP+9IPyqJajHdtKN1ybTK2pDBX527qxnfamsUFVtXMXO0XtoMOJJTL0dCNCw4abcTSZ3Tu0Hpy7kqdk1YlkiUFTUeiLoIND6OQDQBKP9HUnrUFZdFE+bE9lO0wXss/5yBeUPhguIaMkfkB/NpQuuuxJazV4Jh2sk2NBaM1LPfLaRADejN4Zc5cSx6A0GLmuTzPDyF0LVMBwqXWlTh+r2G+IQBAwbHhU/5ohw+5FmxDeVZCjakgu7ylpLY9W6SWKiJBOCKBKGtCVnSHY6VAka9BQtfSAz3dPYPAmWX2ATFGSsYbD0fdPFnXlTvO5kNcWDrCjlU8u3QXBCx/0bAMC5SIDsfQxvDrSAyP9Q0pon8I67wymgWOoEayq/Ba48/scAxYHCP2bN5gxhP8oOwwX8lk+7TnWONxshC+ODiXoChcl8TsXm6BBTILgl769 K/RmigcO XCtwPXpkqx8uNAR2iOXygTQvcqp7ieKD0W/Q6rLoy/1RghZ00glsOQoWtWpJGUuNcEeD6zx+mSX6JSGW5lWNhZzmVn9bmKqd4+EFS5oqvllFiTND/lOXsML3LeFbRfXwmirc+bFmDN5PGLpzlrMyVBaQ3Cn1Wrrmdqq6zZ4zhVLdNcvm8FHMsg/PldOufdWX3Q/e5h+3m453kxwD8nc+vIYV//CDAdsPcLfLZVvQhTzOWHLGi1T70GL7AR2MU4rzg6FQ8i067Z3qq06PFhvyNUe1mcyOSkVTmKgii4ZqKKfxFvzQ8bvui0zYtyO7iwFetiBp8b1+yMbHfl9+Rw2QA4OylrDmEV3hlX0AmoJMy6xX0ckonCZ94kncpPo8V63Mp3jTRPiZpsmoZCYqt4r65Rrjte+5xZzPwNOHRxpj7J+4BjDG1wCjlT6bFUHm6rfbGv2HJdREvTPe35y7Y1dUeuFD5TzUWo2ra0ZxWAqZhzvgVO9CgRw/y68wVxw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/5/29 12:02, Nico Pache wrote: > On Wed, May 28, 2025 at 8:04 AM Baolin Wang > wrote: >> >> >> >> On 2025/5/28 17:26, David Hildenbrand wrote: >>> On 22.05.25 11:39, Baolin Wang wrote: >>>> >>>> >>>> On 2025/5/21 18:23, Nico Pache wrote: >>>>> On Tue, May 20, 2025 at 4:09 AM Baolin Wang >>>>> wrote: >>>>>> >>>>>> Sorry for late reply. >>>>>> >>>>>> On 2025/5/17 14:47, Nico Pache wrote: >>>>>>> On Thu, May 15, 2025 at 9:20 PM Baolin Wang >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 2025/5/15 11:22, Nico Pache wrote: >>>>>>>>> khugepaged scans anons PMD ranges for potential collapse to a >>>>>>>>> hugepage. >>>>>>>>> To add mTHP support we use this scan to instead record chunks of >>>>>>>>> utilized >>>>>>>>> sections of the PMD. >>>>>>>>> >>>>>>>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a >>>>>>>>> bitmap >>>>>>>>> that represents chunks of utilized regions. We can then determine >>>>>>>>> what >>>>>>>>> mTHP size fits best and in the following patch, we set this >>>>>>>>> bitmap while >>>>>>>>> scanning the anon PMD. A minimum collapse order of 2 is used as >>>>>>>>> this is >>>>>>>>> the lowest order supported by anon memory. >>>>>>>>> >>>>>>>>> max_ptes_none is used as a scale to determine how "full" an order >>>>>>>>> must >>>>>>>>> be before being considered for collapse. >>>>>>>>> >>>>>>>>> When attempting to collapse an order that has its order set to >>>>>>>>> "always" >>>>>>>>> lets always collapse to that order in a greedy manner without >>>>>>>>> considering the number of bits set. >>>>>>>>> >>>>>>>>> Signed-off-by: Nico Pache >>>>>>>> >>>>>>>> Sigh. You still haven't addressed or explained the issues I >>>>>>>> previously >>>>>>>> raised [1], so I don't know how to review this patch again... >>>>>>> Can you still reproduce this issue? >>>>>> >>>>>> Yes, I can still reproduce this issue with today's (5/20) mm-new >>>>>> branch. >>>>>> >>>>>> I've disabled PMD-sized THP in my system: >>>>>> [root]# cat /sys/kernel/mm/transparent_hugepage/enabled >>>>>> always madvise [never] >>>>>> [root]# cat >>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>>>>> always inherit madvise [never] >>>>>> >>>>>> And I tried calling madvise() with MADV_COLLAPSE for anonymous memory, >>>>>> and I can still see it collapsing to a PMD-sized THP. >>>>> Hi Baolin ! Thank you for your reply and willingness to test again :) >>>>> >>>>> I didn't realize we were talking about madvise collapse-- this makes >>>>> sense now. I also figured out why I could "reproduce" it before. My >>>>> script was always enabling the THP settings in two places, and I only >>>>> commented out one to test this. But this time I was doing more manual >>>>> testing. >>>>> >>>>> The original design of madvise_collapse ignores the sysfs and >>>>> collapses even if you have an order disabled. I believe this behavior >>>>> is wrong, but by design. I spent some time playing around with madvise >>>>> collapses with and w/o my changes. This is not a new thing, I >>>>> reproduced the issue in 6.11 (Fedora 41), and I think its been >>>>> possible since the inception of madvise collapse 3 years ago. I >>>>> noticed a similar behavior on one of my RFC since it was "breaking" >>>>> selftests, and the fix was to reincorporate this broken sysfs >>>>> behavior. >>>> >>>> OK. Thanks for the explanation. >>>> >>>>> 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage >>>>> collapse") >>>>> "This call is independent of the system-wide THP sysfs settings, but >>>>> will fail for memory marked VM_NOHUGEPAGE." >>>>> >>>>> The second condition holds true (and fails for VM_NOHUGEPAGE), but I >>>>> dont know if we actually want madvise_collapse to be independent of >>>>> the system-wide. >>>> >>>> This design principle surprised me a bit, and I failed to find the >>>> reason in the commit log. I agree that "never should mean never," and we >>>> should respect the THP/mTHP sysfs setting. Additionally, for the >>>> 'shmem_enabled' sysfs interface controlled for shmem/tmpfs, THP collapse >>>> can still be prohibited through the 'deny' configuration. The rules here >>>> are somewhat confusing. >>> >>> I recall that we decided to overwrite "VM_NOHUGEPAGE", because the >>> assumption is that the same app that triggered MADV_NOHUGEPAGE triggers >>> the collapse. So the app decides on its own behavior. >>> >>> Similarly, allowing for collapsing in a VM without VM_HUGEPAGE in the >>> "madvise" mode would be fine. >>> >>> But in the "never" case, we should just "never" collapse. >> >> OK. Let's fix the "never" case first. Thanks. > Great, I will update that in the next version! I've sent a patchset to fix the MADV_COLLAPSE issue for anonymous memory and shmem [1]. Please have a look. [1] https://lore.kernel.org/all/cover.1748506520.git.baolin.wang@linux.alibaba.com/