From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E52AC25B76 for ; Wed, 5 Jun 2024 07:08:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A86BE6B0082; Wed, 5 Jun 2024 03:08:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A34506B0083; Wed, 5 Jun 2024 03:08:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 94A0A6B0085; Wed, 5 Jun 2024 03:08:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 781406B0082 for ; Wed, 5 Jun 2024 03:08:29 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B8007121142 for ; Wed, 5 Jun 2024 07:08:28 +0000 (UTC) X-FDA: 82195956696.06.6A1A810 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf07.hostedemail.com (Postfix) with ESMTP id AF8BD40010 for ; Wed, 5 Jun 2024 07:08:26 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=dHzXOXpR; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf07.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717571306; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UBVG0Oge1qMz+94/6SX4UYGu1c5UvPIule8sjkHTpYo=; b=cjHWdc5BL6k0VElIhVMBhpycPtPqAbbKw10KaBx1/V2d31fLIMmbowJCgcWfviFozfIUj0 BQ9MSzxgA1DyTHuiztJDXEPkSMAfyJfi2jXlw4eQb3waAWeoNRo9HGUoB4Z5MZCzI/1+H+ 6fJqAFo72IRDGK5YTtKx3UeXszwN/p8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717571306; a=rsa-sha256; cv=none; b=bzRF6di57BoTU9P42kmWl5bfqkhq/HuyUtyzob30Z+LKnWq9rg9puq08lf8rMnf3AjD/Zz mE0PS1HTaFvp0GcvFyjTg4HZ6qAd4VMzvaUN7ewcbQOV1b+Orz+z8fSZKrUFMepARxR3rN FZhLiZNXyzYwQ7TtY7y3eeM2NZS7p6k= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=dHzXOXpR; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf07.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id CA4266150C for ; Wed, 5 Jun 2024 07:08:25 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7DBA8C3277B for ; Wed, 5 Jun 2024 07:08:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717571305; bh=zjM++pe63O6XKtNRcatJaeIibgHtGuI8Go40VKO/Xa4=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=dHzXOXpRvCf/XxBUz4jNV0jJFNe0IXGlYUPEfIoyFPl8zb7yT7dJQnFRmi0qjLrEm txx1QcCSjUyjogWpbFpmJm8r7Yfa10it/+YgKdVoaCTpNq3QIR3ZRknLTskg0ENfZl OleR/xIbuI6Jtqg6Ta2BFdte2HdGIao/f1Qjdc3JDaSFWBVUH3TI3ipz3nYhw/+51Z mqqXYVh5M3/CGfYJ2VD9vcxxy7bnZpF4PBOwxnQBMSKkuQl55n5LS0huj2J/mvIzQu /EX9Us5X9AUaiuomZ44NaGZ/tKRzfJqLUWaBdYAdxXXRXcUzwhEjDc/dz8ZR06XzMA rQVhQvrWURVyQ== Received: by mail-il1-f173.google.com with SMTP id e9e14a558f8ab-374a6b5c504so11220305ab.2 for ; Wed, 05 Jun 2024 00:08:25 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCVtiIuiYW0hFgHDZG/GOC/a5z38Aee/Qbmx6SwuC3RRGgiByIDXBq+Cj/nR7YjMU1LEEtcEVHmZfB7a/sJLgM62svM= X-Gm-Message-State: AOJu0YxUZAED22g7G1n4QwL41rZ6UAfFKCtxLKTeVQOlS5OI3SB+3arq 3vwPLlhOkJAiJkIrLARE5QqSiHXHqQzxFcP+D9oQiSgg6f71St0Vm5cq0h6W34XZz13OfEeRKNe IXZ8r4yt1wSFdfCvQcoojx5NjMNNb49SkbPw6 X-Google-Smtp-Source: AGHT+IEnS98wNsaLy6nngBjTzn7eeLgDDA4TXWawq105LwHHH4IZMeX8vB8ppQSpQT8iDwKH7n9Ay27zmewlXKNnS1c= X-Received: by 2002:a92:ca0b:0:b0:374:ad0f:1b00 with SMTP id e9e14a558f8ab-374b1eff031mr16421305ab.15.1717571304752; Wed, 05 Jun 2024 00:08:24 -0700 (PDT) MIME-Version: 1.0 References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org> <87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com> <875xuw1062.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Chris Li Date: Wed, 5 Jun 2024 00:08:12 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order To: "Huang, Ying" Cc: Andrew Morton , Kairui Song , Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 13ycc1p5uuqc5saj9m48yt6txncpjb68 X-Rspamd-Queue-Id: AF8BD40010 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1717571306-241196 X-HE-Meta: U2FsdGVkX185W3k1EVm4jHHLsWWONE7g1/kR9cO44GxcBxuNyUhUD7clLBcudSdFUt53YJa063VIsRmJfxpO6KQGeFVLW5unrmXe+GpXwFWuft2zYc5Vg3ZvCSjkVaOVQ08UU4vL0dc7HJInTjgwdjl94JDA+kTO9psq38s1kOsJljwXF0QV3KseICg1PkWaM3p1B9T1fbRS1lFqHjC12naBD6H051JtC2ktP5S3c8txwABpfw4xjLVYqj6seySqe+QhYVwK/ftpYpvSEphB96lW/MYlJ7twQVRazk8VvketZEi9NyYyXpDZ/mbdS2c4HQFw1KQ5Wi0wBYOkEfKyq3zqDYYi2Qita4guGlZ1X7o6yfHjYACVq8pNCFQdD2nlgi2yl7hm6NQq9e89TfGrRRSmqN5CoLMt1fGtFhIBT8RHgzzZKzL6/k67QfS4iDaNvYewrcwlgyebarsuAV7dI2qkTA0KoKPTmij7nWwFxYyu3W5XbJBp/j2S5AwsFouhi81hxTUeApkuCP5JuWIXjdLpVODqfLDkaf4nlvRbitvmcewlKJ2RnqXLDdoqJKrHQTKI+X6HMAibsVxOKQM281Pz/nPVtcURUfvCzfeq07hohmFSDPzH0fQ5HDJ2ozvUnbZhfBGW39ob1Tfo4M7lPulco5lBO6739BLneOz4vm1Y9UN7hAuWAMXZig+uWtkN9P6Tn4veqrA+OEjHsi4aXwd54r7hICjaVQc4K3WAxtID1MSLV71zq4ImB10AbeYeOpYsybuF1Lg26rUvSsc3htB0yC8FG5jHWhdD7m0HXTwWvJnrShYxvQUkYSjcSSWgu5svmsPGS4yFOOsFhQeCIky4Ma9Jce89gdgRd0VzbvZxl2k2ECeq2ecZ/PLfKQMVpQGBQdRNFshoW3FUeVslcZ7+p98+5Vl0qXWA4JoAhhi/FrQpnTDCkWxgQazz3kCsrb1KyGpHAdNvhPH3C4Q 44f3vRLm 5jj15j5UryVYVcxtGrsgSAkcFSMWoWRr6MNvFbL9fqc9yZAtHQejQuU8LgH5BvgIReGu0+c+N2l52M38vrQ9uPBjv3bDfcBN2eVr86v5NSMaMos0xynEXqK1NiCaybwW2U4mi+9StC5RXa2aLBL6+O/7YcvRCQZO2QGt9yiVgPUzeOwR1cd8LI3KM67N3Uh5HBXl+rFCUBtgREaDbC2mduzbVXNZG41gp1PuSzpIyfUx/aZDt05PxXEM2KbO2iGCVA0sVO6dfQOl8z7NbcSCERxYNj9inADtsAsKBAlGm/0jgbqU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000008, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, May 30, 2024 at 7:37=E2=80=AFPM Huang, Ying = wrote: > > Chris Li writes: > > > On Wed, May 29, 2024 at 7:54=E2=80=AFPM Huang, Ying wrote: > > because android does not have too many cpu. We are talking about a > > handful of clusters, which might not justify the code complexity. It > > does not change the behavior that order 0 can pollut higher order. > > I have a feeling that you don't really know why swap_map[] is scanned. > I suggest you to do more test and tracing to find out the reason. I > suspect that there are some non-full cluster collection issues. Swap_map[] is scanned because of running out of non full clusters. This can happen because Android tries to make full use of the swapfile. However, once the swap_map[] scan happens, the non full cluster is polluted= . I currently don't have a local reproduction of the issue Barry reported. However here is some data point: Two swap files, one for high order allocation only with this patch. No fall back. If there is a non-full cluster collection issue, we should see the fall back in this case as well. BTW, same setup without this patch series it will fall back on the high order allocation as well. > > >> Another issue is nonfull_cluster[order1] cannot be used for > >> nonfull_cluster[order2]. In definition, we should not fail order 0 > >> allocation, we need to steal nonfull_cluster[order>0] for order 0 > >> allocation. This can avoid to scan swap_map[] too. This may be not > >> perfect, but it is the simplest first step implementation. You can > >> optimize based on it further. > > > > Yes, that is listed as the limitation of this cluster order approach. > > Initially we need to support one order well first. We might choose > > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages > > are too big. The sweet spot might be some there in between. If we can > > support one order well, we can demonstrate the value of the mTHP. We > > can worry about other mix orders later. > > > > Do you have any suggestions for how to prevent the order 0 polluting > > the higher order cluster? If we allow that to happen, then it defeats > > the goal of being able to allocate higher order swap entries. The > > tricky question is we don't know how much swap space we should reserve > > for each order. We can always break higher order clusters to lower > > order, but can't do the reserves. The current patch series lets the > > actual usage determine the percentage of the cluster for each order. > > However that seems not enough for the test case Barry has. When the > > app gets OOM kill that is where a large swing of order 0 swap will > > show up and not enough higher order usage for the brief moment. The > > order 0 swap entry will pollute the high order cluster. We are > > currently debating a "knob" to be able to reserve a certain % of swap > > space for a certain order. Those reservations will be guaranteed and > > order 0 swap entry can't pollute them even when it runs out of swap > > space. That can make the mTHP at least usable for the Android case. > > IMO, the bottom line is that order-0 allocation is the first class > citizen, we must keep it optimized. And, OOM with free swap space isn't > acceptable. Please consider the policy we used for page allocation. We need to make order-0 and high order allocation both can work after the initial pass of allocating from empty clusters. Only order-0 allocation work is not good enough. In the page allocation side, we have the hugetlbfs which reserve some memory for high order pages. We should have similar things to allow reserve some high order swap entries without getting polluted by low order one. > > > Do you see another way to protect the high order cluster polluted by > > lower order one? > > If we use high-order page allocation as reference, we need something > like compaction to guarantee high-order allocation finally. But we are > too far from that. We should consider reservation for high-order swap entry allocation similar to hugetlbfs for memory. Swap compaction will be very complicated because it needs to scan the PTE to migrate the swap entry. It might be easier to support folio write out compound discontiguous swap entries. That is another way to address the fragmentation issue. We are also too far from that as right now. > > For specific configuration, I believe that we can get reasonable > high-order swap entry allocation success rate for specific use cases. > For example, if we only do limited maximum number order-0 swap entries > allocation, can we keep high-order clusters? Yes we can. Having a knob to reserve some high order swap space. Limiting order 0 is the same as having some high order swap entries reserved. That is a short term solution. Chris