From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C95DCC369A2 for ; Fri, 11 Apr 2025 18:22:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 96140680021; Fri, 11 Apr 2025 14:22:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 90ECE68001E; Fri, 11 Apr 2025 14:22:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D684680021; Fri, 11 Apr 2025 14:22:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 5F7D768001E for ; Fri, 11 Apr 2025 14:22:01 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6E6CBAD482 for ; Fri, 11 Apr 2025 18:22:01 +0000 (UTC) X-FDA: 83322582042.07.0D469B4 Received: from mail-qk1-f182.google.com (mail-qk1-f182.google.com [209.85.222.182]) by imf22.hostedemail.com (Postfix) with ESMTP id 29D20C000C for ; Fri, 11 Apr 2025 18:21:59 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=L++k421R; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf22.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.182 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744395719; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tKqiMrQYuUOQJDIH+t14CV1L3h37l2QDGVazr53LeIY=; b=BbsuPdtG+zhLmhxRudwLTAlClr0TT0AoB78rCJJFLHJArQ8KxxGCI3BuQwMQiAeaTWYy+u 0aRUn8cFIl0MfeteGCU4slyzYx/dSoAXh4kXoPKeJETNPf5a8pUE6QfTbu4QVrYUANvCCg MXjgonUXi3r211bP6JthAU3fpqLUhjQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744395719; a=rsa-sha256; cv=none; b=2WMbqtl5SoxVGD7R6W9TrkOOIvUSXSKKSlYrb0TCckbLX5O/rJqC7Fe/UzR12fI72zfgNz i6ellMsK8+2AoQ0C0gyhlseND1U1azoF6mX2CgJ+W5KoaEfFGAeBV/TLLo3Di7Fxm2zYbW uLRX2lk+tULI8nTtOSYKglrbabWHpg0= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=L++k421R; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf22.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.182 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org Received: by mail-qk1-f182.google.com with SMTP id af79cd13be357-7c5e39d1e0eso220640285a.1 for ; Fri, 11 Apr 2025 11:21:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1744395718; x=1745000518; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=tKqiMrQYuUOQJDIH+t14CV1L3h37l2QDGVazr53LeIY=; b=L++k421RpEMkTk4JY66TzDQGGQqSk9F5bQMbHuAFh5nM/x0Zwrv973Xruy4ltoWaIn 5cCB4a3lnqusXXO9b7I7HH4BGQDqFTN1azXTsB8UkSEtzaC7s5cnN4w9EeMjgouBXfZz e4sC4WcpEUIyGtGSrjmmMdo1JHmStx9kH/V+BExTkE5D+Gxa2GjEqKmfGs2pOqFNbNun UR0xIdABMlmnQziKyqgzX4ZV9qZeFVo+0d3FRDJr71Xz/zu+Q8HX7s4syCQ9fqtIDYrB IORHSM8IRR30XleQl47dE69RYfO2JGqpvzJqmuVY+jHlwGPijtUaeAAPkTmd9Goj10x5 HWnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744395718; x=1745000518; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=tKqiMrQYuUOQJDIH+t14CV1L3h37l2QDGVazr53LeIY=; b=rUgxWEaHrvK96oy/o1YRO1tRS0VmijFkTMAKKtjbnsRfYlBR7RrKljtWCK5YrFYOwZ IbTdaDC5XoIxoK+tM4QDJBK0974FxJ3PFjH3qUc3k8kXWJJrSlGG8y/WFjBXuVgSJvFB gyXKVQjgQEtkNzjxT/3+Zx7MnrguMod9qqQ29K1XEs5qyH83LmyAimtKixBPVzG7Pkzw R6vUbCAtyvrSLLMmSspvOz7E6N3bjLlVSBv1N5S0tehM383/2/yRjKvGs4jjsKcDeYSc U8PoqB1VQMwRmpefN77/roTZOexgdBF0GrarSRrn8SJgO9uOhjncOiegxGAJrJ8zGEt1 glcg== X-Forwarded-Encrypted: i=1; AJvYcCWUq25Tv0JdsxhGpbZVuyguPrtH4tHRBheyuFfOhQDyYGR5/6sv3wkivibwSwanTTtGGY7ZW8Fj5g==@kvack.org X-Gm-Message-State: AOJu0YxsP6Sp2Po3Gi82/GO9h7oYpiChb/kAHiDlUgS6F6aWYCf9RyOc VOjeK6TSJ30fJTig2PqoDD7OLjGZFd1Oj+rKqaowbaJ/4Z+QSXZ6qbfsSSO7J31P6oTyvosQ8f6 m X-Gm-Gg: ASbGncsMaFlZhJnIlA+/MmKICxUOdh/LwCv1gRLtX1AT/lkfPD/mnh+0+dGB5M/Pnwm PtVQ//yHWylCpbOoDWFDLLhZMrPfEIigqVso/VQufqKqQLw+u9J3fbk+LhUoXn5PfbuZSKGMiMA Uz2madeIdYB9sW21VVameoDZa+FIJ/K+CP4NI8TK1dQ7ePqYjSqVH4t1WR+6gMpHGvnaYGnQWeJ 6TveJ1zS2+WWpDGIMo+K5/fhkayc+WlUnYVns9F4F/F/nsAJUq2w6ELAcEjt0WLWXsohf07+Np4 tIeHoxDIHpx2Bj/r9uegda7DP3oI6px9uqyOjac= X-Google-Smtp-Source: AGHT+IFP5i8asMphTbO//EcXhT7tEAdDoywkvdhj4bCFryDnvhvpiDrafe5VTkjrjelyP+Ckao5TYg== X-Received: by 2002:a05:620a:4001:b0:7c5:4278:d15e with SMTP id af79cd13be357-7c7af15287dmr519282685a.33.1744395717985; Fri, 11 Apr 2025 11:21:57 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:365a:60ff:fe62:ff29]) by smtp.gmail.com with UTF8SMTPSA id af79cd13be357-7c7a8969e26sm294494585a.64.2025.04.11.11.21.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 11 Apr 2025 11:21:57 -0700 (PDT) Date: Fri, 11 Apr 2025 14:21:56 -0400 From: Johannes Weiner To: Vlastimil Babka Cc: Andrew Morton , Mel Gorman , Zi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks Message-ID: <20250411182156.GE366747@cmpxchg.org> References: <20250313210647.1314586-1-hannes@cmpxchg.org> <20250313210647.1314586-6-hannes@cmpxchg.org> <46f1b2ab-2903-4cde-9e68-e334a0d0df22@suse.cz> <20250411153906.GC366747@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 29D20C000C X-Stat-Signature: dd11nywb1mdot97ux4awixc7t9uq865g X-Rspam-User: X-HE-Tag: 1744395719-34424 X-HE-Meta: U2FsdGVkX1/bBLrR+lk+3fdwPi3OhHXJqgHPQFcg+WDo2cJCvqKbgBktaTblU1ws/NxgVkfZBQyn2UYTm7iU8PKA03r++pWtmBWcAsihTCCTgUU9oTsemt8VJ1ibQhAX+Xi1hdkR9z93YX9Pa027c8/ZF9KmlWSQVe/uNvs/ELVvO+ZhbD+wIdC0uVeQap0ToKQla8gFHREZx8mTSJjjZ8UviRV/AznMbPHXhR+efEvRMgaMEBOseGyx0Us9Rst2NDHU3TPOp8x/E0TUpnA0lTE4MdYvO3p+QCjVH5q5L1rbIzclrcdqWVJJc0Af5bis789EtVLc3bCGeDqX+KwlRAyDXZU7HFfykEcAZmOrwkxlTL4FyuGpd+bVOcYuB1pqu/hI1y7R+T0N7IHDMNYPE10cn13pu/GUMePPvqrvY0HHKgmrF/+yZnwLqWu0sY/XxWIjzwRkxxMlwAnz585k8bXNktTSQTZCrhhHHhTDwBkpjxx2FmaoNkwuXbG0HMV+QkJhKseHhXbXnEEkrzX4myRm+doHMgxKVhoLNCLaImCq04wB5tYuhKUMkOcW9uQEQQFjzRuMtlFzAgoaB+OAbCAwuY8U62pn3lKLeUFwKY/xfa3coRbaetLl4zDgz6k3Y1wPq3Rz+olcD7TSeOZZY/Bc4C6TtTuSUG0SBOF8VoniIFO5ofKcPa/gVHE0WRl8vGItXD2//18rrotz5HCWNA7xm/++/y6jZkBe9QgbFkvkr5fm/H7nOuDFclz/5epPjA4vleWXda5NSpTOOTp4uR64j6xss36SYug1BGlC2+xkrEi/iG5ZIlthERd02lPouuKI+thmz4v+XnvxNx56xB5Q02AkM2xOJcNliATJNynI3GdhYMoudK57ABrfgJrrYadn0d5QAGhmQxCGDyowWDQ8rQKzgOMFhdmmcZomUm3q1eQIgMbFFoe5IlbfI4Eh/25qGOScZZFtIFJSphv zslWkZld J7JGcwBUY1kL6H1XlEx1RNtWrBDy2ldXfEIL0ictRYfb4Q7a1cwhfCBW7TgqLr/ZvNGJW8B2T5Eev3taEClh8hriHH6o2ZakW4TeMMBvC0SOPPx/VmQ3aFYQWz02yQc6YAypotdlInXvSawUk2rmk2kk7gXBl7A0V19hvSMPD8bf/hdNxQq43MiodfM3swrCTJYNg42HV26x221THFpn/9IfgsL2alY29ioKn6YKwxKF8r+19BmDWjYX98CRnBhQmkelvoXGhZIoU9AumGaqS5BzBlohYLLNYe+rxftpUIETsuSc6IoNOdenjY0fmjfD2QwVYa7IrqG/DnPzXMgUI5pxA/fAG+GdGbn/+9q8MJfpYkwl8vV51JqeAXHdivUdkciLNWjngoqVGXXQW3g07jFdv+HCkrHlmeK02a7LN0q0NpVlUhHyW/IVr2qt3QQvJtOBLG2MIj6sF6E7Gb9rlD6l43bbjFGSfo10RHcJp1KDLx1OkbFHjPPotXoiuJSuw4Ug4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote: > On 4/11/25 17:39, Johannes Weiner wrote: > > On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote: > >> On 3/13/25 22:05, Johannes Weiner wrote: > >> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc) > >> > if (!pageblock_aligned(cc->migrate_pfn)) > >> > return COMPACT_CONTINUE; > >> > > >> > + /* > >> > + * When defrag_mode is enabled, make kcompactd target > >> > + * watermarks in whole pageblocks. Because they can be stolen > >> > + * without polluting, no further fallback checks are needed. > >> > + */ > >> > + if (defrag_mode && !cc->direct_compaction) { > >> > + if (__zone_watermark_ok(cc->zone, cc->order, > >> > + high_wmark_pages(cc->zone), > >> > + cc->highest_zoneidx, cc->alloc_flags, > >> > + zone_page_state(cc->zone, > >> > + NR_FREE_PAGES_BLOCKS))) > >> > + return COMPACT_SUCCESS; > >> > + > >> > + return COMPACT_CONTINUE; > >> > + } > >> > >> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned > >> to pageblock size? If not, and it's X pageblocks and a half, we will rarely > >> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can > >> put us below high wmark quickly and then we never satisfy this? > > > > The high watermark is not aligned, but why does it have to be? It's a > > binary condition: met or not met. Compaction continues until it's met. > > What I mean is, kswapd will reclaim until the high watermark, which would be > 32.7 blocks, wake up kcompactd [*] but that can only create up to 32 blocks > of NR_FREE_PAGES_BLOCKS so it has already lost at that point? (unless > there's concurrent freeing pushing it above the high wmark) Ah, but kswapd also uses the (rounded up) NR_FREE_PAGES_BLOCKS check. Buckle up... > > Of course, similar to kswapd, it might not reach the watermarks and > > keep running if there is a continuous stream of allocations consuming > > the blocks it's making. Hence the ratio between wakeups & continues. > > > > But when demand stops, it'll balance the high mark and quit. > > Again, since kcompactd can only defragment free space and not create it, it > may be trying in vain? > > [*] now when checking the code between kswapd and kcompactd handover, I > think I found a another problem? > > we have: > kswapd_try_to_sleep() > prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd() > pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true > - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but > we were only reclaiming until now and didn't wake up kcompactd and > this actually prevents the wake up? Correct, so as per above, kswapd also does the NR_FREE_PAGES_BLOCKS check. At first, at least. So it continues to produce adequate scratch space and won't leave compaction high and dry on a watermark it cannot meet. They are indeed coordinated in this aspect. As far as the *handoff* to kcompactd goes, I've been pulling my hair over this for a very long time. You're correct about the graph above. And actually, this is the case before defrag_mode too: if you wake kswapd with, say, an order-8, it will do pgdat_balanced() checks against that, seemingly reclaim until the request can succeed, *then* wake kcompactd and sleep. WTF? But kswapd has this: /* * Fragmentation may mean that the system cannot be rebalanced for * high-order allocations. If twice the allocation size has been * reclaimed then recheck watermarks only at order-0 to prevent * excessive reclaim. Assume that a process requested a high-order * can direct reclaim/compact. */ if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) sc->order = 0; Ignore the comment and just consider the code. What it does for higher orders (whether defrag_mode is enabled or not), is reclaim a gap for the order, ensure order-0 is met (but most likely it is), then enter the sleep path - wake kcompactd and wait for more work. Effectively, as long as there are pending higher-order requests looping in the allocator, kswapd does this: 1) reclaim a compaction gap delta 2) wake kcompactd 3) goto 1 This pipelining seems to work *very* well in practice, especially when there is a large number of concurrent requests. In the huge allocator original series, I tried to convert kswapd to use compaction_suitable() to hand over quicker. However, this ran into scaling issues with higher allocation concurrency: maintaining just a single, static compact gap when there could be hundreds of allocation requests waiting for compaction results falls apart fast. The current code has it right. The comments might be a bit dated and maybe it could use some fine tuning. But generally, as long as there are incoming wakeups from the allocator, it makes sense to keep making more space for compaction as well. I think Mel was playing 4d chess with this stuff. [ I kept direct reclaim/compaction out of this defrag_mode series, but testing suggests the same is likely true for the direct path. Direct reclaim bails from compaction_ready() if there is a static compaction gap for that order. But once the gap for a given order is there, you can get a thundering herd of direct compactors storming on this gap, most of which will then fail compaction_suitable(). A pipeline of "reclaim gap delta, direct compact, retry" seems to make more sense there as well. With adequate checks to prevent excessive reclaim in corner cases of course... ]