From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14B38D0BB49 for ; Thu, 24 Oct 2024 04:36:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9F3476B0082; Thu, 24 Oct 2024 00:36:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9A2F26B0083; Thu, 24 Oct 2024 00:36:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 86A0F6B0085; Thu, 24 Oct 2024 00:36:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 69B3B6B0082 for ; Thu, 24 Oct 2024 00:36:19 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 48F9181054 for ; Thu, 24 Oct 2024 04:36:03 +0000 (UTC) X-FDA: 82707233322.27.4554910 Received: from mail-vk1-f175.google.com (mail-vk1-f175.google.com [209.85.221.175]) by imf14.hostedemail.com (Postfix) with ESMTP id D6779100010 for ; Thu, 24 Oct 2024 04:35:56 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vFILZG8c; spf=pass (imf14.hostedemail.com: domain of yuzhao@google.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729744499; a=rsa-sha256; cv=none; b=hkHlOx3E4UHUD6Yi88ddKikp17VVhhAEq+gZNNOPQFh2i2ItiWtJ/Wsm0mkC+jK3y2bxFS atTgmZ4VGBSrzWtMgXP3iDBocB7LKQ6iplJtlZ6RjWpSxKXKNdSZ0n6w/EctYdhu2yVqFv KbUXwuFK2x3PykSxyAbnp92Yoi/LJxs= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vFILZG8c; spf=pass (imf14.hostedemail.com: domain of yuzhao@google.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729744499; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6Ewr3HuDmhbonYbWbM1qZPgVLnVPcDLg9+uev5FKUIE=; b=zLf6SrE+zQWlyysHsBQCOlpgxHpSmEe3RO5t7jOox7fbBcumIwnojEo2nM3THG/X99aa2/ KxRUQ7agR6JhY6485XeN/EPMlx6HITiT0aG/wv/9k9D64egYhZJ6+ErrQ5aLKgO25UiE4T fddw/LcVrzzN0tn2jvfuGxlOWeBBbaI= Received: by mail-vk1-f175.google.com with SMTP id 71dfb90a1353d-50d35639d0aso171824e0c.0 for ; Wed, 23 Oct 2024 21:36:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1729744576; x=1730349376; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=6Ewr3HuDmhbonYbWbM1qZPgVLnVPcDLg9+uev5FKUIE=; b=vFILZG8cQE4yi4VX7ANUss6Z0m5XBGS3oCnz+zt5eUHGXzffxelOO/vIK7Z927tXLx mvJHdFUYIl1FHh84jau57l5l4G+N54YXmNeU0UTqSBeEcbaOd0LKB1cGaISDC4gVdMt7 X1O9QZw+B8k9w7Y9ydcVCNVUjauHKcVl/H0SaEBPEIQK7+D6+4ank+j5GY8raEHzXiTA tA7LbVuH6tjgrKt9Yp7TWszP961S75S+P9bsFd6IbbGNRt1pkiWmqVQK8Fzb9O0ROBWa ToNvvgbok0fWu1ZlkXNhcjQs4X+q1vM50MZrVZPjY+8bs+kcLygxtEkU7RBx3Ss1KkEv RHRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729744576; x=1730349376; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6Ewr3HuDmhbonYbWbM1qZPgVLnVPcDLg9+uev5FKUIE=; b=uCCywdG6mb8pE6QFassFtK9prYGmoWb9JAyw7Py1iTGzMJh9kvLmnHFwAJWnrmG0I/ 3fKeFq2wLQodamrMSObKXP2tDfcwECv7Kxe/JvmGEkPXzbIzEy8U+kA1EUv34HZTp8lM 6Y5ptPFGWpyy5/sxDT7h++fpK5FETgxOzCxco/z1sO0lzwRF0OzxCfEEDkiF6sABJZQk AjdUsSReqVOs50jt9RmjDXrRdUQyJb4Db8Who++2moltURJG8jiwb36Qe095LNt1kc7+ paju57Bzavcii9NNSk8jV+mHuH7+697OXnI3OwwRCqhljM9oHNFpcyFeLzf1OeAETiZq 9R8Q== X-Forwarded-Encrypted: i=1; AJvYcCXdm3JGYpPTcyWR8rg1779qqNccmyr3XwczUGtrGUxmTRyc/ihQQEbZNCNDmieYeKWpOGQ6tSVtiA==@kvack.org X-Gm-Message-State: AOJu0Yzp/RcKbQKPVMXSzpQz1c+rfuYzDtxZJdIfCFwFU/E91H80AEAn 44iooE1qX3hErvXJB2qDmgb+Ae8qIZRdBLEw2kITM/b6T6iWJ7PXx97FoP2Qhthbs4OWaE0bz9p VIyjWZFsE0n3eIXyTcm2brog3m7bQVjzK7g49 X-Google-Smtp-Source: AGHT+IEjwwD9qCiVlq+xTEfIc8kjrodOvAiOtDnf1Fg8xPI7qGul6zZvdCWU9IWh0lEu0GpcEj+Oz/AdHQCTd7rDjIE= X-Received: by 2002:a05:6102:4411:b0:4a3:cce7:8177 with SMTP id ada2fe7eead31-4a751c07bddmr6211099137.16.1729744575768; Wed, 23 Oct 2024 21:36:15 -0700 (PDT) MIME-Version: 1.0 References: <20241020051315.356103-1-yuzhao@google.com> <82e6d623-bbf3-4dd8-af32-fdfc120fc759@suse.cz> <97ccf48e-f30c-4abd-b8ff-2b5310a8b60f@suse.cz> In-Reply-To: <97ccf48e-f30c-4abd-b8ff-2b5310a8b60f@suse.cz> From: Yu Zhao Date: Wed, 23 Oct 2024 22:35:38 -0600 Message-ID: Subject: Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic To: Vlastimil Babka , Mel Gorman Cc: Michal Hocko , Andrew Morton , David Rientjes , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Link Lin , Matt Fleming Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: h3wakfb81k6c43fwp36utr4b5mnjeoyq X-Rspamd-Queue-Id: D6779100010 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1729744556-221283 X-HE-Meta: U2FsdGVkX18d96Tv8+bIzRAbROPBrvbTXpqbmyt5FlY8guQ3dTuizUlnAXqVXieaudyQLrTZvqvyNabTLqR+6g/HGTHtgpqBGd12UaQzRtDoU6WxbIVndaTzbUK1avOiyMfcAYRDFe77Qr27lU4Pb2QK1UjeornojgmwL3CAAAnGwWo9XoCEFJB7Z8kn5vtrRr38p2EBCLl4igc26DqCTWBuk/PrIeMkPxgKeadjHwn3s3+rpx0LJY/5XpTXo00q3o5W0lKI+TGxc/bn5ggNvVOeXKYnBbgmmYiEXMqONWuST1N42TsQQHqVYNhy+3k9yhQ0zbSobe2ACL/HDrCjGCKeY0ZpEbiwAjGIG0dapxAmVmsORt4RTp1rOhoArY1ApbOGt+m94a3KFfsgcHaNWeA30xpf1Cn+sxnydBp53FQlQVTa5dZPtTiyABMkJuuQo6wFVbd6gOF6FvWUZ8vtHu9TWB97nAf5RaaMgN6PLV+1Txj7QiW8Luq8Cq5dQayloUYpb3R5C5d5k5kt09Bvb7+1ov/qQeSwn+9HhpIWLuI8zgOt3TxGiDLoPTdKMsuCLwCR3BB5ARlw98X7hAz7Of2XU2BtCkfBhdBltCjuk3vY79wB1CrZV23/rbrRD42wNINxAx4Vdq0qKyVn0MyrjoP3z/PdmMLIyTTRHMDoDmkFepWIs+THLz74Md/X8dhLx0R3Ir27UAY3bTQFZV1O6MxaI1nZWSpwQlemWZg9+XuLRDjqsuI3tP/V2RM3dVBKdeBJhLu9tCYCosLBdyBs1qXiQmXBWhgFuWgiBs4FikosPMm8A57hB/VTnEvNuPTmYdPyHLwQPJ2iVw/hj7vj6OMCh4wn97klV8+QpP7FedMpz4awS9RAbCMIRgqWO5gQ3aHCGDsvFYUUonMYFIBRWlrRwUPOG94Xb4+y++mKhBmaZRKDgsR5+yfT/C8w9KctMMehEQ2YKmnl4KfDKBq 8gODlV3K +1SfkCQNU7YU/VWDd3glLW3ws8nCvuAfw4qpFeQpo6k+xTacmE+TLBHwU5rHYc2bVoq8V22oJ26TUHa9wTaCYLB72WlLHOUyHCnQvZaP8GVc9slE3ZkQOjwF2+lNXKbKD6SPLpH2DUdZO9+SbVbNvd8hBYi93G45UAA883F/ep6CbnOWcghLzLnm4i/1JlDGKUxn2bGF6flbtbrwdv5jPkKDCx6rnub0+OSHdpNGW/yBv0/xYPP0Te0UJCBQqP1fE0S3uB+ad1kwvylzbaSgN7qbhxyBMAvcZlnKAJLhR9hLMPYI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 23, 2024 at 1:35=E2=80=AFAM Vlastimil Babka wr= ote: > > On 10/23/24 08:36, Yu Zhao wrote: > > On Tue, Oct 22, 2024 at 4:53=E2=80=AFAM Vlastimil Babka wrote: > >> > >> +Cc Mel and Matt > >> > >> On 10/21/24 19:25, Michal Hocko wrote: > >> > On Mon 21-10-24 11:10:50, Yu Zhao wrote: > >> >> On Mon, Oct 21, 2024 at 2:13=E2=80=AFAM Michal Hocko wrote: > >> >> > > >> >> > On Sat 19-10-24 23:13:15, Yu Zhao wrote: > >> >> > > OOM kills due to vastly overestimated free highatomic reserves = were > >> >> > > observed: > >> >> > > > >> >> > > ... invoked oom-killer: gfp_mask=3D0x100cca(GFP_HIGHUSER_MOVA= BLE), order=3D0 ... > >> >> > > Node 0 Normal free:1482936kB boost:0kB min:410416kB low:73940= 4kB high:1068392kB reserved_highatomic:1073152KB ... > >> >> > > Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*3= 2kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024k= B (M) 0*2048kB 0*4096kB =3D 1477408kB > >> >> > > > >> >> > > The second line above shows that the OOM kill was due to the fo= llowing > >> >> > > condition: > >> >> > > > >> >> > > free (1482936kB) - reserved_highatomic (1073152kB) =3D 409784= KB < min (410416kB) > >> >> > > > >> >> > > And the third line shows there were no free pages in any > >> >> > > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as= type > >> >> > > 'H'. Therefore __zone_watermark_unusable_free() overestimated f= ree > >> >> > > highatomic reserves. IOW, it underestimated the usable free mem= ory by > >> >> > > over 1GB, which resulted in the unnecessary OOM kill. > >> >> > > >> >> > Why doesn't unreserve_highatomic_pageblock deal with this situati= on? > >> >> > >> >> The current behavior of unreserve_highatomic_pageblock() seems WAI = to > >> >> me: it unreserves highatomic pageblocks that contain *free* pages s= o > >> > >> Hm I don't think it's completely WAI. The intention is that we should = be > >> able to unreserve the highatomic pageblocks before going OOM, and ther= e > >> seems to be an unintended corner case that if the pageblocks are fully > >> exhausted, they are not reachable for unreserving. > > > > I still think unreserving should only apply to highatomic PBs that > > contain free pages. Otherwise, it seems to me that it'd be > > self-defecting because: > > 1. Unreserving fully used hightatomic PBs can't fulfill the alloc > > demand immediately. > > I thought the alloc demand is only blocked on the pessimistic watermark > calculation. Usable free pages exist, but the allocation is not allowed t= o > use them. I think we are talking about two different problems here: 1. The estimation problem. 2. The unreserving policy problem. What you said here is correct w.r.t. the first problem, and I was talking about the second problem. > > 2. More importantly, it only takes one alloc failure in > > __alloc_pages_direct_reclaim() to reset nr_reserved_highatomic to 2MB, > > from as high as 1% of a zone (in this case 1GB). IOW, it makes more > > sense to me that highatomic only unreserves what it doesn't fully use > > each time unreserve_highatomic_pageblock() is called, not everything > > it got (except the last PB). > > But if the highatomic pageblocks are already full, we are not really > removing any actual highatomic reserves just by changing the migratetype = and > decreasing nr_reserved_highatomic? If we change the MT, they can be fragmented a lot faster, i.e., from the next near OOM condition to upon becoming free. Trying to persist over time is what actually makes those PBs more fragmentation resistant. > In fact that would allow the reserves > grow with some actual free pages in the future. Good point. I think I can explain it better along this line. If highatomic is under the limit, both your proposal and the current implementation would try to grow, making not much difference. However, the current implementation can also reuse previously full PBs when they become available. So there is a clear winner here: the current implementation. If highatomic has reached the limit, with your proposal, the growth can only happen after unreserve, and unreserve only happens under memory pressure. This means it's likely that it tries to grow under memory pressure, which is more difficult than the condition where there is plenty of memory. For the current implementation, it doesn't try to grow, rather, it keeps what it already has, betting those full PBs becoming available for reuse. So I don't see a clear winner between trying to grow under memory pressure and betting on becoming available for reuse. > > Also not reachable from free_area[] isn't really a big problem. There > > are ways to solve this without scanning the PB bitmap. > > Sure, if we agree it's the way to go. > > >> The nr_highatomic is then > >> also fully misleading as it prevents allocations due to a limit that d= oes > >> not reflect reality. > > > > Right, and the comments warn about this. > > Yes and explains it's to avoid the cost of searching free lists. Your fix > introduces that cost and that's not really great for a watermark check fa= st > path. I'd rather move the cost to highatomic unreserve which is not a fas= t path. > > >> Your patch addresses the second issue, but there's a > >> cost to it when calculating the watermarks, and it would be better to > >> address the root issue instead. > > > > Theoretically, yes. And I don't think it's actually measurable > > considering the paths (alloc/reclaim) we are in -- all the data > > structures this patch accesses should already have been cache-hot, due > > to unreserve_highatomic_pageblock(), etc. > > __zone_watermark_unusable_free() will be executed from every allocation's > fast path, and not only after we recently did > unreserve_highatomic_pageblock(). AFAICS as soon as nr_reserved_highatomi= c > is over pageblock_nr_pages we'll unconditionally start counting precisely > and the design wanted to avoid this. > > > Also, we have not agreed on the root cause yet. > > > >> >> that those pages can become usable to others. There is nothing to > >> >> unreserve when they have no free pages. > >> > >> Yeah there are no actual free pages to unreserve, but unreserving woul= d fix > >> the nr_highatomic overestimate and thus allow allocations to proceed. > > > > Yes, but honestly, I think this is going to cause regression in > > highatomic allocs. > > I think not as having more realistic counter of what's actually reserved > (and not already used up) can also allow reserving new pageblocks. > > >> > I do not follow. How can you have reserved highatomic pages of that = size > >> > without having page blocks with free memory. In other words is this = an > >> > accounting problem or reserves problem? This is not really clear fro= m > >> > your description. > >> > >> I think it's the problem of finding the highatomic pageblocks for > >> unreserving them once they become full. The proper fix is not exactly > >> trivial though. Either we'll have to scan for highatomic pageblocks in= the > >> pageblock bitmap, or track them using an additional data structure. > > > > Assuming we want to unreserve fully used hightatomic PBs, we wouldn't > > need to scan for them or track them. We'd only need to track the delta > > between how many we want to unreserve (full or not) and how many we > > are able to do so. The first page freed in a PB that's highatomic > > would need to try to reduce the delta by changing the MT. > > Hm that assumes we're adding some checks in free fastpath, and for that t= o > work also that there will be a freed page in highatomic PC in near enough > future from the decision we need to unreserve something. Which is not so > much different from the current assumption we'll find such a free page > already in the free list immediately. > > > To summarize, I think this is an estimation problem, which I would > > categorize as a lesser problem than accounting problems. But it sounds > > to me that you think it's a policy problem, i.e., the highatomic > > unreserving policy is wrong or not properly implemented? > > Yeah I'd say not properly implemented, but that sounds like a mechanism, = not > policy problem to me :) What about adding a new counter to keep track of the size of free pages reserved for highatomic? Mel?