From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34747C4332F for ; Tue, 7 Nov 2023 17:28:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4C2B08D004B; Tue, 7 Nov 2023 12:28:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 44B948D0001; Tue, 7 Nov 2023 12:28:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2C4AE8D004B; Tue, 7 Nov 2023 12:28:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 130488D0001 for ; Tue, 7 Nov 2023 12:28:25 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id AB007120A6A for ; Tue, 7 Nov 2023 17:28:24 +0000 (UTC) X-FDA: 81431842128.30.C8351A8 Received: from mail-ua1-f44.google.com (mail-ua1-f44.google.com [209.85.222.44]) by imf09.hostedemail.com (Postfix) with ESMTP id 5938314002C for ; Tue, 7 Nov 2023 17:28:22 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=BCNKn9BJ; spf=pass (imf09.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.44 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699378102; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9JWYsxX7o9LoPZR+AYodFHECF/k1HsJ4BkaZR+67S10=; b=Pv9VHp85LYRBbClA0sJV7Y+vQksIseABdwidc6Hxhpf+RZVZzdHOHXxlzyUiVZT4lOP7vc k5hPKfQs3Tetri9ZfrfW3FWqj56uXueusC5+o1Pnj9edPxgFdbKElrMGFIJyintQIS+pvu MGYjk+37a+GLD+QxR46Ba4+8iBj109U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699378102; a=rsa-sha256; cv=none; b=f4HzKkCdeMyE3XsubP97V8wXeHI+qZesUYhVRVCDCmuX5KzGLtaOaD8H/OO4jsGH9QR96v pWeuJtfMqqSzOsZBxGFpfI3gCyUJkswexDMEvK5AGWtxp64TLmU75VpZtLQPWExBYj1VcL jaOEUkmbFikXtNgzCDiy6Uw1+7D4ZZI= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=BCNKn9BJ; spf=pass (imf09.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.44 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org Received: by mail-ua1-f44.google.com with SMTP id a1e0cc1a2514c-7ba0c234135so1630847241.1 for ; Tue, 07 Nov 2023 09:28:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1699378101; x=1699982901; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=9JWYsxX7o9LoPZR+AYodFHECF/k1HsJ4BkaZR+67S10=; b=BCNKn9BJq2oU19JoOwKLtSsUnVsv4gv69NjOHD2Io0+kSKd1NOB4GQrXV8nAY6qF04 G/BnuqOzDsChu0NpLJ3TW1Azbx2LetZOxUf82K0M9W9kkf6wM2BdSxtz9SoCvYC6zQsx 0+Dn/fTs4VkanoF41fHvIm1bKFk6t+YtygfMdMiN8RdpS/dofVgaWZqkX6gj0OQ0Fagd T6HRlBFvLn5dErlOojc1N1UZ9HVMq+QfZARFVyDKBTAeQh3RmH+NCvyyr+1cmw5iBaDa Kf8/dgPIzvs7Ay5VvJE8yU4CQMhaLaS4bSYt0M/GFAB0ROjbaJDE5TlN8wtpYSvHvTmh QI6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699378101; x=1699982901; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=9JWYsxX7o9LoPZR+AYodFHECF/k1HsJ4BkaZR+67S10=; b=cBFt76rTjiYDRpEPNgqJDj7Fcifl3F+XcrFE9vXanDGZ3ptrxdtk10NanukaJasFWH 5t/ZUGbDgwKFiRz7GasI3IWSpNFoMu9PJRlPLFGU8GvOhtDTAuV84M6804clcFtzXeu1 wzu3UoAppZSf4UBl98M8mYZA1Lmpnc/Uu+dnfN9wsDwVEpcQHrAkkjlhIdl+NfHxmkGU Webf18rvMNUZoAeLV73haJPJEb0/okx+eFw2PptHixUa+x74QDHzVUi3zm5xEpFlh5so w3m06W2gqlinM75vMGAcrNZTTIANSCNgZylqOg1bEexEwICsEgLLSczyrfJNjsOWc++S lsxQ== X-Gm-Message-State: AOJu0YyW67wbENu7I4rujhEOl/vovjW9cEeOe8azCY915eU2UB10WtK8 F5yFXzPsCl+Oec8tYLKAo3zYLw== X-Google-Smtp-Source: AGHT+IEudbmue8M6yA0xYro+jpm67k3FqoYEVMiQJJJo3YDKtmhjQwvFlnib/BHQNSOPw13z4fvWdQ== X-Received: by 2002:a05:6102:5493:b0:45f:3b30:9c93 with SMTP id bk19-20020a056102549300b0045f3b309c93mr5730588vsb.22.1699378101200; Tue, 07 Nov 2023 09:28:21 -0800 (PST) Received: from localhost ([2620:10d:c091:400::5:86b7]) by smtp.gmail.com with ESMTPSA id o11-20020a056214108b00b0067095b0c473sm98923qvr.11.2023.11.07.09.28.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 09:28:20 -0800 (PST) Date: Tue, 7 Nov 2023 12:28:20 -0500 From: Johannes Weiner To: "zhaoyang.huang" Cc: Andrew Morton , Roman Gushchin , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Zhaoyang Huang , steve.kang@unisoc.com, Vlastimil Babka , Mel Gorman , Joonsoo Kim Subject: Re: [PATCHv6 1/1] mm: optimization on page allocation when CMA enabled Message-ID: <20231107172820.GA3745089@cmpxchg.org> References: <20231016071245.2865233-1-zhaoyang.huang@unisoc.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20231016071245.2865233-1-zhaoyang.huang@unisoc.com> X-Stat-Signature: mmggp1truo5q1wio6ku4riq6utihwcfm X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 5938314002C X-Rspam-User: X-HE-Tag: 1699378102-459325 X-HE-Meta: U2FsdGVkX184ueqPNB5QdYpGq7AHCUmFjGRUCqYTDdIPBavjU9ZEqTTQGDDXMddKyw2bZv+tgO/nHjQpaePUVKqwilHKIY5R7NYKXz2g6GIfjFKo1ifGFDFgJhdnrqKVslRg6/QGq7Dd8cdcUuO5cSltTTjFQ4wa+dA0Sv+yfCilRRQ41x+tTp7lw41eylonqQq7eaev1Pfs+nep+MQdsiVh6ocn4r1ml6K4udME0VBOiscBIV7V2o56hhrLP+yBSKrTL4Xca0PoRvbvOAnzloRG2X3Qx/wh3NJqyUlvQG1JzG0tj62MarSFKet+rrzKlM6xLHe4+UL/kZLhLO2dfO3s+xeKMB1O/Qk14HbgcVDiSmMhebirURuI6hweHLG6zX8Le6MDJ2+P22B8cULEFUPYuwmBcq6tblxogux39mrMpcvUNpOQEJ9TjeUnuL3UkvFKxGZCGxsM3xaX4R8q1fMSi0rJOAEqDMiJ+HWZYZyT696hbzbIIkTspvJ86vYFab9aHgHoEY0InJ2fVD8/66zxA4RkFj6taEDijLEShKASwoZJu68uGVEm9ZfG/EF5tCM4BNlM3elHF2d3yE06xtER2gFk6OwT7RdK9h3YZLsQ44VYCJTCPxTyamAIicD9MfX1e+V4sdg+Xslk/hAwRUbPfTvXyLYT0k0i9Nw7J/Ge3uz8Cfv6v9Q5k7bEAADzvc7E1dk8pq3Ss92AZ3wUXTYlN2t8moRcdqL31PxWeljZ2C+V9IiwdNrVQHQxHIur711rj+xtEow5evkIQY+7uV2yfqgAJCG0m3pgF/O20nSUj1vskFLW5cWAVYqabRXaza4QTf7U/UOJy2EvagXnjM1KInRzHh9jGmiNNYdUHTRNMfyMZ7XuL6y/0Woyjw8hzMvsXDhJV3zqVoTBQaj1J30Ujg8shbZZQ4OWynd8ANDVkT2WbdXP5MRHBNqNG5+tWoupeuJ8RiCjpEjJ6Gh bhaJnaS9 U2klYa5MurLdVJ3f627ldSDqCBWktFleda/JU/JQd+J0qwBeFQSmPj+YydnS3tli9s0ZwzGPsoDYN3LFwMpMvS/76ewL+tQmD3a26MBjTAYpemU8jK1AUNSitausAGazg1ve1s+zbnbTCsdDCAIaxP8Yddo7af4lVACXRVRpUACo+B3srUG7uP49H3ZMyc5ZMhb47MrVT/m3ZqYUYfXvPuxjcF6bgdrQXA5VZKr4eMVTVuZnblZCoXfuQThYJTmeMswLsfF0DHFNhCaL7g4Ef5wfN3Z2JzpNlsWoayqbHjEEosE5s77sCnijhi0sBUsRiyb18pOB2n/RUl8X+AGX580WlYREE7tVbJvCSXE9C7bn/GwB2bzTOuc2tvQkrOgH62QHjOic6Vo9V2VEaJFwcX4HEuURKTv5RH84Vl0zdheMjonN9nBp09jMcXfAFNylenp+SV1ewaRSDJHVTONGvtx2W52eVlE3GY5SqQogh8LzoL8eBLIjxyyivhrn+lNs4SJ9AeD1XXpVIc26heeyal+FwA/xYhHKGnbc9XbPbAE28h+nWV/OawC9yj9UlGUQS6kbdQNJ8uEMDax2GajKgnhG87HuELvHH+2hn X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Oct 16, 2023 at 03:12:45PM +0800, zhaoyang.huang wrote: > From: Zhaoyang Huang > > According to current CMA utilization policy, an alloc_pages(GFP_USER) > could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of > CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue), > which could lead to following alloc_pages(GFP_KERNEL) fail. > Solving this by introducing second watermark checking for GFP_MOVABLE, > which could have the allocation use CMA when proper. > > -- Free_pages(30MB) > | > | > -- WMARK_LOW(25MB) > | > -- Free_CMA(12MB) > | > | > -- We're running into the same issue in production and had an incident over the weekend because of it. The hosts have a raised vm.min_free_kbytes for network rx reliability, which makes the mismatch between free pages and what's actually allocatable by regular kernel requests quite pronounced. It wasn't OOMing this time, but we saw very high rates of thrashing while CMA had plenty of headroom. I had raised the broader issue around poor CMA utilization before: https://lore.kernel.org/lkml/20230726145304.1319046-1-hannes@cmpxchg.org/ For context, we're using hugetlb_cma at several gigabytes to allow sharing hosts between jobs that use hugetlb and jobs that don't. > @@ -2078,6 +2078,43 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, > > } > > +#ifdef CONFIG_CMA > +/* > + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via > + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok > + * again without ALLOC_CMA to see if to use CMA first. > + */ > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags) > +{ > + unsigned long watermark; > + bool cma_first = false; > + > + watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); > + /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */ > + if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) { > + /* > + * Balance movable allocations between regular and CMA areas by > + * allocating from CMA when over half of the zone's free memory > + * is in the CMA area. > + */ > + cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) > > + zone_page_state(zone, NR_FREE_PAGES) / 2); > + } else { > + /* > + * watermark failed means UNMOVABLE & RECLAIMBLE is not enough > + * now, we should use cma first to keep them stay around the > + * corresponding watermark > + */ > + cma_first = true; > + } > + return cma_first; I think it's a step in the right direction. However, it doesn't take the lowmem reserves into account. With DMA32 that can be an additional multiple gigabytes of "free" memory not available to GFP_KERNEL. It also has a knee in the balancing curve because it doesn't take reserves into account *until* non-CMA is depleted - at which point it would already be below the use-CMA threshold by the full reserves and watermarks. A more complete solution would have to plumb the highest_zoneidx information through the rmqueue family of functions somehow, and always take unavailable free memory into account: --- Subject: [PATCH] mm: page_alloc: use CMA when kernel allocations are beginning to fail We can get into a situation where kernel allocations are starting to fail on watermarks, but movable allocations still don't use CMA because they make up more than half of the free memory. This can happen in particular with elevated vm.min_free_kbytes settings, where the remaining free pages aren't available to non-atomic requests. Example scenario: Free: 3.0G Watermarks: 2.0G CMA: 1.4G -> non-CMA: 1.6G CMA isn't used because CMA <= free/2. Kernel allocations fail due to non-CMA < watermarks. If memory is mostly unreclaimable (e.g. anon without swap), the kernel is more likely to OOM prematurely. Reduce the probability of that happening by taking reserves and watermarks into account when deciding whether to start using CMA. Signed-off-by: Johannes Weiner --- mm/page_alloc.c | 93 +++++++++++++++++++++++++++++++------------------ 1 file changed, 59 insertions(+), 34 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 733732e7e0ba..b9273d7f23b8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2079,30 +2079,52 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, } +static bool should_try_cma(struct zone *zone, unsigned int order, + gfp_t gfp_flags, unsigned int alloc_flags) +{ + long free_pages; + + if (!IS_ENABLED(CONFIG_CMA) || !(alloc_flags & ALLOC_CMA)) + return false; + + /* + * CMA regions can be used by movable allocations while + * they're not otherwise in use. This is a delicate balance: + * Filling CMA too soon poses a latency risk for actual CMA + * allocations (think camera app startup). Filling CMA too + * late risks premature OOMs from non-movable allocations. + * + * Start using CMA once it dominates the remaining free + * memory. Be sure to take watermarks and reserves into + * account when considering what's truly "free". + * + * free_pages can go negative, but that's okay because + * NR_FREE_CMA_PAGES should not. + */ + + free_pages = zone_page_state(zone, NR_FREE_PAGES); + free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)]; + free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); + + return zone_page_state(zone, NR_FREE_CMA_PAGES) > free_pages / 2; +} + /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ static __always_inline struct page * -__rmqueue(struct zone *zone, unsigned int order, int migratetype, - unsigned int alloc_flags) +__rmqueue(struct zone *zone, unsigned int order, gfp_t gfp_flags, + int migratetype, unsigned int alloc_flags) { struct page *page; - if (IS_ENABLED(CONFIG_CMA)) { - /* - * Balance movable allocations between regular and CMA areas by - * allocating from CMA when over half of the zone's free memory - * is in the CMA area. - */ - if (alloc_flags & ALLOC_CMA && - zone_page_state(zone, NR_FREE_CMA_PAGES) > - zone_page_state(zone, NR_FREE_PAGES) / 2) { - page = __rmqueue_cma_fallback(zone, order); - if (page) - return page; - } + if (should_try_cma(zone, order, gfp_flags, alloc_flags)) { + page = __rmqueue_cma_fallback(zone, order); + if (page) + return page; } + retry: page = __rmqueue_smallest(zone, order, migratetype); if (unlikely(!page)) { @@ -2121,7 +2143,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, * a single hold of the lock, for efficiency. Add them to the supplied list. * Returns the number of new pages which were placed at *list. */ -static int rmqueue_bulk(struct zone *zone, unsigned int order, +static int rmqueue_bulk(struct zone *zone, unsigned int order, gfp_t gfp_flags, unsigned long count, struct list_head *list, int migratetype, unsigned int alloc_flags) { @@ -2130,8 +2152,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, spin_lock_irqsave(&zone->lock, flags); for (i = 0; i < count; ++i) { - struct page *page = __rmqueue(zone, order, migratetype, - alloc_flags); + struct page *page = __rmqueue(zone, order, gfp_flags, + migratetype, alloc_flags); if (unlikely(page == NULL)) break; @@ -2714,8 +2736,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, static __always_inline struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, - unsigned int order, unsigned int alloc_flags, - int migratetype) + unsigned int order, gfp_t gfp_flags, + unsigned int alloc_flags, int migratetype) { struct page *page; unsigned long flags; @@ -2726,7 +2748,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, if (alloc_flags & ALLOC_HIGHATOMIC) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (!page) { - page = __rmqueue(zone, order, migratetype, alloc_flags); + page = __rmqueue(zone, order, migratetype, + gfp_flags, alloc_flags); /* * If the allocation fails, allow OOM handling access @@ -2806,10 +2829,10 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) /* Remove page from the per-cpu list, caller must protect the list */ static inline struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, - int migratetype, - unsigned int alloc_flags, - struct per_cpu_pages *pcp, - struct list_head *list) + gfp_t gfp_flags, int migratetype, + unsigned int alloc_flags, + struct per_cpu_pages *pcp, + struct list_head *list) { struct page *page; @@ -2818,7 +2841,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, int batch = nr_pcp_alloc(pcp, zone, order); int alloced; - alloced = rmqueue_bulk(zone, order, + alloced = rmqueue_bulk(zone, order, gfp_flags, batch, list, migratetype, alloc_flags); @@ -2837,8 +2860,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, /* Lock and remove page from the per-cpu list */ static struct page *rmqueue_pcplist(struct zone *preferred_zone, - struct zone *zone, unsigned int order, - int migratetype, unsigned int alloc_flags) + struct zone *zone, unsigned int order, + gfp_t gfp_flags, int migratetype, + unsigned int alloc_flags) { struct per_cpu_pages *pcp; struct list_head *list; @@ -2860,7 +2884,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, */ pcp->free_count >>= 1; list = &pcp->lists[order_to_pindex(migratetype, order)]; - page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list); + page = __rmqueue_pcplist(zone, order, gfp_flags, migratetype, + alloc_flags, pcp, list); pcp_spin_unlock(pcp); pcp_trylock_finish(UP_flags); if (page) { @@ -2898,13 +2923,13 @@ struct page *rmqueue(struct zone *preferred_zone, if (likely(pcp_allowed_order(order))) { page = rmqueue_pcplist(preferred_zone, zone, order, - migratetype, alloc_flags); + gfp_flags, migratetype, alloc_flags); if (likely(page)) goto out; } - page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, - migratetype); + page = rmqueue_buddy(preferred_zone, zone, order, gfp_flags, + alloc_flags, migratetype); out: /* Separate test+clear to avoid unnecessary atomics */ @@ -4480,8 +4505,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, continue; } - page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags, - pcp, pcp_list); + page = __rmqueue_pcplist(zone, 0, gfp, ac.migratetype, + alloc_flags, pcp, pcp_list); if (unlikely(!page)) { /* Try and allocate at least one page */ if (!nr_account) { -- 2.42.0