From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 141BEE7717D for ; Fri, 13 Dec 2024 07:56:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 969A86B0083; Fri, 13 Dec 2024 02:56:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 917856B0085; Fri, 13 Dec 2024 02:56:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B7CE6B0088; Fri, 13 Dec 2024 02:56:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 5BDF36B0083 for ; Fri, 13 Dec 2024 02:56:32 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 0162780AF9 for ; Fri, 13 Dec 2024 07:56:31 +0000 (UTC) X-FDA: 82889178330.28.EE2864E Received: from mail-vk1-f181.google.com (mail-vk1-f181.google.com [209.85.221.181]) by imf09.hostedemail.com (Postfix) with ESMTP id E6B7214000B for ; Fri, 13 Dec 2024 07:56:12 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=KP0EUcsu; spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734076563; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QAZFmiTyYiF000hk68Buo6otiAg019Vj2ktyMah92L0=; b=RGSbX+ME2OG4EEfBQFtJhXNb+c2lUW95rTkG8dclvleaRUEwNPyEuuYkOtvqVWUJcTK/U9 z1w1fD+u0qiBaI+pYsyvtoKlbuiU/aaIJbFAXGSudFGN4AJNVGNQGokyLapgqu7i8fGTUH g+q8eqcHLZPwE+dMxUM2VY88Kpfc4MM= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=KP0EUcsu; spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734076563; a=rsa-sha256; cv=none; b=dSVhB/UuXB6mEyMaLCVDKTdW8dggKrHMVJtKpWTau4CTEsv1G+E+dqBNsY+DdXGdxmxw7+ VgU+e3LhRgh6lZWv15/AjgKTOWVmo9ZLptoqClssW5xOg51RijUvt9PQX8RkKoT2A7xdYh tDqKbsSApSoa5rpjjjZEvFsKpAiLBdE= Received: by mail-vk1-f181.google.com with SMTP id 71dfb90a1353d-518ae5060d4so370163e0c.0 for ; Thu, 12 Dec 2024 23:56:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1734076589; x=1734681389; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=QAZFmiTyYiF000hk68Buo6otiAg019Vj2ktyMah92L0=; b=KP0EUcsuTQnZZ/x+eCM6t6V7qe/RpB5rkWjcbgqto3W1wruKNpMmNyVzu6rMqTtDw9 v9cosrWZ2NpaZK7p3cxUjHv+DmvzxKFky74YV3jkvr1qjzHpztCcgmGCjl7yu5z+lc8g r0ITU7W9iTebs2BmbhxFjQttnHPGBb6QUf1nyatUDRzoNZ8iIpZgBPey6OqTS5NBCOll oMpMb7NSU9zFoMK597bl4G6thcz5d7HHuJ0nT6DHaSRFo7OUzfS7rok9wDcm7nLeqvOd g14v1xRF/aoDGeUxuqJQyDsEEOoV8utnEDOHdLjALVAhPHl5uQOaEo8bI1pEp+bf9JxQ 8dzA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734076589; x=1734681389; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=QAZFmiTyYiF000hk68Buo6otiAg019Vj2ktyMah92L0=; b=cDpLhxYDrJk7o/4mQAX18TyapaI6nwuhTOnZ+H+5+o8vxMzkCVFydyqvVJKbBstJyh 1pP0vBd/xb0djR2pUrnSUZxZYPT7Jrmr7yV2HPqPAWCPtc+tWD5yvDY4qKSExvMPi5Le HBsz44yXDHTXv12dJFpRfyKfSp3CB9NC+sK5qcr6iw9putDIdkNhQtZWXjSoDwkzPBQc 4B9LyXYD4zPIckkZORJgIXkQh0AX23+suQfJvYOTKu09f6OGsa4gLg47k05fs2+5meR5 z6VgyBgxpgOm6QubywUSLy3QhwX3zycF/D1n2EMKRtdMZGZP5jgxLEd9cQGo3uBmHLd7 X9aQ== X-Forwarded-Encrypted: i=1; AJvYcCUZp/jTwhL22Ft2q9RElSIcvDFPV3g3t5uzG3YsJ5Xz09Jx1hjBoLrmuqC4JsAZ6NIB0DqY/xL/Gg==@kvack.org X-Gm-Message-State: AOJu0YxwPMpzObMou68Bl8e+gG2T3myrPAHMXMvffXALDrRnii6EbLJL w7S5odaUMEHiLK0SZ6+fgK+9mtoYN44kBiME6ugg4u3D3A2xWsdUB8Yc9pPJp/IY7TdX55x4oK3 zTpHTXFUeTRVOlZeFzntx5MJuYFs= X-Gm-Gg: ASbGncvI0SKIHMDHmFMOSC8g837sIgGra2hO0dlMffcagQfIG+5OmUKVEl8vfS+Uqn2 drTid6wjO6FCRNvqJjnGNZPtkI1Jc9L2h8eManqh+XuaQVIb1wrqAHJ6FhNluuaoSsWdMZaQi X-Google-Smtp-Source: AGHT+IHzjoKLoHgKyyvyCHSY+nRVoq4LKRH2Nv9Zon5WLa5154FhDvAMikc4GqMQWi9syUuz9ZHOhdn71sB9i/U31TU= X-Received: by 2002:a05:6122:2a09:b0:518:791a:3462 with SMTP id 71dfb90a1353d-518ca45a40amr1354897e0c.9.1734076589259; Thu, 12 Dec 2024 23:56:29 -0800 (PST) MIME-Version: 1.0 References: <1734075432-14131-1-git-send-email-yangge1116@126.com> In-Reply-To: <1734075432-14131-1-git-send-email-yangge1116@126.com> From: Barry Song <21cnbao@gmail.com> Date: Fri, 13 Dec 2024 15:56:17 +0800 Message-ID: Subject: Re: [PATCH] mm, compaction: don't use ALLOC_CMA in long term GUP flow To: yangge1116@126.com Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, david@redhat.com, baolin.wang@linux.alibaba.com, vbabka@suse.cz, liuzixing@hygon.cn Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: E6B7214000B X-Rspam-User: X-Stat-Signature: zbga911zqut7roq4rdwiuznub94p5miy X-HE-Tag: 1734076572-93980 X-HE-Meta: U2FsdGVkX1/VbvEQZ7Eq/ctb/R4YbzLv2LQFRcy8M7lg+evwIUj0emgE4ZidNnPwX6kd1NIECkDM3QMu6vCraE6eQ05dqd+8fsoYDUKTOq3aNRGDdwXdsNF2jIOPInDWRN5klyXOxKnIbLd0kbi2Njle8lSFB8ZkWkfBhnzyfA9kfzf64YVN8jWmCAygDGKnFNUi+csfDyMNPZw8FBjSBW+OmA+f2SNigiL19AbMIiV5LPhgLUv7+DaeiIpme/hoEsGJtZEdQs6urLxeliImZuzbWSMkFQ7gKiAnpV4wlzNKJ7YR5EeAyVZN4v5WfbPkU8TLw+5KfynxbGTIB8MHq8z2pBt3VrrBfkZDceb1yXCnS3sZut29yNxSEfklifPmg+goZiAX0lkjst+5/1m8bzOH30rZ2reaqn+K+vEjYBqoKXkIyK1uzPO25LyY/d+CWVK91/ykQMfCWrvSNp8jKo16lFBGASRxyUwkHct2TAaveINbrlCt2DY5p9s2U6HaZrdhGIrm7aKo8nO5tYhTbokfSoDTuKookwQltfAmXGN5BbjE93W0VzSyOe1BtxTolNNOKHvQGnwUeIZ+yeYAchHPTl88xIYaLoaw7A8sMBh2CHB8k7IJhsl/FXlPoGzv4mQVOjPUhvsXsEN4RnvXl0ZelIdg5dA7XgWPZOn85JR9IC7ufzlWBbT7BC13azesKVHxI7Mid9L26si6APIGhV4BWITawZg7oLuja3fjCJ8mKmOtUWdKdB+kbKUoAaTEhp/89A2LDRpqkETSdIMDLhDux5fWK8U1XsfkkOkckxjqWZ8QoerW4So/37hnj+wvgQASZKU/3q8wQdCx4uKvpUb2g2ZMj/9xD6/a03eVBFFqOzeZUH7cFCF/tSUS3oroOJRnkydX89+u8VX2RntmuKBJ9pWa1Qw2g1BhDhmvLYP9o8LcKzIQsxLGK+UBE2uBbM84EOFvEC3h6ByCyoe 3rrgEMGH 4IjOObFFWZ1rMuhWgMnJ7TesB1eK6VAGsn0z7BmTc9EzTOGAq/aGTKm4SX4LyOP3+0neHpsKWRy54Z/PesDGc0CjwDJ9ALg9Tx8nPV44LGu9VeE7PNgRorezwnjf4CPuwALk4xEdmTU8Rz4apymJyYcrQtiUMxjG6BLZfV47tNwlyHDRJ2fQmOeIqfLaUGlEefiLUyQpHk0pgUmliBJu/57qVhjzFqIS6e+zKJdKzjO1UU6AZfkKvn9BvXJrrSTc/YHVNvwVl4p2rDA/eEqrLeEBokxehli8zwiiVFQf9F1s300sbynqkthJ/LGV0ILccWy8jE94jQBLYBCpo3Ay5JTdDoGRv7tsMwSmR4p07fbw/ME4EcoLe3fhlwe/XIHThBCy8YtzuFrim+HoepEQAcIOqTkS14wl9MqdhZ1/nygFo1iDoD7hXJesENqvPc2qQ8pqd X-Bogosity: Unsure, tests=bogofilter, spamicity=0.470582, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Dec 13, 2024 at 3:37=E2=80=AFPM wrote: > > From: yangge > > Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags > in __compaction_suitable()") allow compaction to proceed when free > pages required for compaction reside in the CMA pageblocks, it's > possible that __compaction_suitable() always returns true, and in > some cases, it's not acceptable. > > There are 4 NUMA nodes on my machine, and each NUMA node has 32GB > of memory. I have configured 16GB of CMA memory on each NUMA node, > and starting a 32GB virtual machine with device passthrough is > extremely slow, taking almost an hour. I don't fully understand why each node has a 16GB CMA. As I recall, I desig= ned the per-NUMA CMA to support devices that are not behind the IOMMU, such as the IOMMU itself or certain device drivers which are not having IOMMU and need contiguous memory for DMA. These devices don't seem to require that much memory. > > During the start-up of the virtual machine, it will call > pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. > Long term GUP cannot allocate memory from CMA area, so a maximum > of 16 GB of no-CMA memory on a NUMA node can be used as virtual > machine memory. Since there is 16G of free CMA memory on the NUMA > node, watermark for order-0 always be met for compaction, so > __compaction_suitable() always returns true, even if the node is > unable to allocate non-CMA memory for the virtual machine. > > For costly allocations, because __compaction_suitable() always > returns true, __alloc_pages_slowpath() can't exit at the appropriate > place, resulting in excessively long virtual machine startup times. > Call trace: > __alloc_pages_slowpath > if (compact_result =3D=3D COMPACT_SKIPPED || > compact_result =3D=3D COMPACT_DEFERRED) > goto nopage; // should exit __alloc_pages_slowpath() from here > > To sum up, during long term GUP flow, we should remove ALLOC_CMA > both in __compaction_suitable() and __isolate_free_page(). What=E2=80=99s the outcome after your fix? Will it quickly fall back to rem= ote NUMA nodes for the pin? > > Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compact= ion_suitable()") > Cc: > Signed-off-by: yangge > --- > mm/compaction.c | 8 +++++--- > mm/page_alloc.c | 4 +++- > 2 files changed, 8 insertions(+), 4 deletions(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index 07bd227..044c2247 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -2384,6 +2384,7 @@ static bool __compaction_suitable(struct zone *zone= , int order, > unsigned long wmark_target) > { > unsigned long watermark; > + bool pin; > /* > * Watermarks for order-0 must be met for compaction to be able t= o > * isolate free pages for migration targets. This means that the > @@ -2395,14 +2396,15 @@ static bool __compaction_suitable(struct zone *zo= ne, int order, > * even if compaction succeeds. > * For costly orders, we require low watermark instead of min for > * compaction to proceed to increase its chances. > - * ALLOC_CMA is used, as pages in CMA pageblocks are considered > - * suitable migration targets > + * In addition to long term GUP flow, ALLOC_CMA is used, as pages= in > + * CMA pageblocks are considered suitable migration targets > */ > watermark =3D (order > PAGE_ALLOC_COSTLY_ORDER) ? > low_wmark_pages(zone) : min_wmark_pages(z= one); > watermark +=3D compact_gap(order); > + pin =3D !!(current->flags & PF_MEMALLOC_PIN); > return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx, > - ALLOC_CMA, wmark_target); > + pin ? 0 : ALLOC_CMA, wmark_target); > } > > /* > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index dde19db..9a5dfda 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2813,6 +2813,7 @@ int __isolate_free_page(struct page *page, unsigned= int order) > { > struct zone *zone =3D page_zone(page); > int mt =3D get_pageblock_migratetype(page); > + bool pin; > > if (!is_migrate_isolate(mt)) { > unsigned long watermark; > @@ -2823,7 +2824,8 @@ int __isolate_free_page(struct page *page, unsigned= int order) > * exists. > */ > watermark =3D zone->_watermark[WMARK_MIN] + (1UL << order= ); > - if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA)) > + pin =3D !!(current->flags & PF_MEMALLOC_PIN); > + if (!zone_watermark_ok(zone, 0, watermark, 0, pin ? 0 : A= LLOC_CMA)) > return 0; > } > > -- > 2.7.4 > Thanks Barry