From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 102B3C27C4F for ; Tue, 18 Jun 2024 09:32:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 89F996B0335; Tue, 18 Jun 2024 05:32:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 84F496B0336; Tue, 18 Jun 2024 05:32:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F0076B0337; Tue, 18 Jun 2024 05:32:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 50EBB6B0335 for ; Tue, 18 Jun 2024 05:32:16 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id EA37A8063E for ; Tue, 18 Jun 2024 09:32:15 +0000 (UTC) X-FDA: 82243493430.25.3963F4F Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf12.hostedemail.com (Postfix) with ESMTP id DC69540019 for ; Tue, 18 Jun 2024 09:32:13 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=fYr7rKTZ; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf12.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718703128; a=rsa-sha256; cv=none; b=b/1mSKgl/Op1j4/4bu+BonrUikpta24LfedeqYRT2rrzBleCRv61ylNy5Toahb9TX9Fn2V NONqMJVq10+jfBnGMFDAaAf87/xwNxtdmcbpyDgoxGoSg+m1/i0QnbyOa16XE9vr+t/bxv gK+6Gajqyqa9c3TAFM92SvLyTJgwF5g= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=fYr7rKTZ; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf12.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718703128; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GdlxHVnAK/t4RK5t5vASDklDhv3rK5Hk5LzHq3jcJRg=; b=NpHY3v5UtBV4zCjwzBx1LnaSc6WYWlQToc1GjY+l+34DjBNoAMurBwZzAK3qS9manGpVAT yoZJ4uUmci9QADiANlE0SuQPh1H67zGHSP3dr8z8Of8a9sClTzmigF4/JtfPBl1ktzgZXm d8rM07FSyXbjM/ASWWjnSx+Sjc3E9d4= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id CABDC6118B for ; Tue, 18 Jun 2024 09:32:12 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3A51AC4AF1D for ; Tue, 18 Jun 2024 09:32:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1718703132; bh=RzTXPSvJC6aaqPPKMMURA/I5IDh2yNcBeKoMyc5LG+c=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=fYr7rKTZf8UpM7m/Ko+n0CdLbkgbYRw6WeDSMUVjW+e8ah527Lvzizlh4c1TRBPcZ EAGAkGC0eR2RYaNU3NYO8PDr0x34cAlgU7YeUfrupMl+3dH1Idmp6CLFxnqlGGfn5P Pp13SUmusOVwlhQgb0gThuG3uiJAiC12NPKbRZsgBq4z3GOcaigZVzZqFuukgSRCdq fctAwLadFMSOhxASYOujZgMfw8tkr9CTCh/v7IxvI3yeKdmhIy+iXCuTLOzX8fNAb0 UGrD5j2JmjE6FnAD/RObkLRmjntA2OnXs6oPYzickmdTUhayUXMQFoSTZG7MO0zOGY g175/akD5cvWw== Received: by mail-il1-f182.google.com with SMTP id e9e14a558f8ab-375af3538f2so20870485ab.3 for ; Tue, 18 Jun 2024 02:32:12 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCWgurePZWKtUg1pZZhOfJy/kTgJ0+rtt93Imfuwpm+ZG97lucUGmdJo6dCXQUBBx/9Rj/UQHL4dGcajdzlR75y4cu0= X-Gm-Message-State: AOJu0YzhsyodsUOG+9pzvTzSbYtt+lgsoPGQWMw68GDn1DDOJhKLgDyE 27tDkcBYEE67GwynxsLcV6XyvwtP30xaSaUefNf4uzJqMIpSVT+i41f34SlPxL7sGTPwnDfYMG7 ufoUjAKRcZs01XAZG4svyUoiQ+lOQLhLvvaxz X-Google-Smtp-Source: AGHT+IGDT2gwYsBDsjs6Av3g2E8Dt0OuIpMrasTGPHmQmFzO4gRxDWFVNW8c3dBisJ0mlor+iFUQwGnf6dV0EonZ7N4= X-Received: by 2002:a05:6e02:17c9:b0:36c:4688:85aa with SMTP id e9e14a558f8ab-375e0e44edbmr147995195ab.10.1718703131415; Tue, 18 Jun 2024 02:32:11 -0700 (PDT) MIME-Version: 1.0 References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org> <87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com> <875xuw1062.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com> <875xum96nn.fsf@yhuang6-desk2.ccr.corp.intel.com> <87wmmw6w9e.fsf@yhuang6-desk2.ccr.corp.intel.com> <87a5jp6xuo.fsf@yhuang6-desk2.ccr.corp.intel.com> <8734pa68rl.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <8734pa68rl.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Chris Li Date: Tue, 18 Jun 2024 02:31:58 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order To: "Huang, Ying" Cc: Andrew Morton , Kairui Song , Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: DC69540019 X-Stat-Signature: nmsjgzpcckbbm43dsr81pegbnqryus8u X-Rspam-User: X-HE-Tag: 1718703133-559498 X-HE-Meta: U2FsdGVkX1+sjd9antocK1SqBioybiI/+VUaHP3YaJvl3iy0WJGONRVbO8JZ519UYa6ipXrr294ZvgHFaTYyPnQRXjf4qSuxjppkHcar2kgifb0w1dwhzIzZjOu1/CtdCmSWYuLe1ZWYrnkzeNdIRyOSLcha+YD61D2sNt6ThwnQxT/NW8TgHiKrGmMTpccIU8JrW+f2+HGfhjuSZlPHBJosuXndccM0X1OWouzdrJvUbLP1TjpjxtCcCYbov1ol/NIbNqvIcVeoM/vLAbsfW/wywUI3T0NHfysOHS7Yu2L1pdBNBPjfZvUCf5oQai04JLfftt5tnKlshaVlokxaZ2sLvAxIMmgDphuto+r7r1/cQq9nKUz2bBSDcQWgZ2Lk+e8j1QJ7q/C/rgfpNvc9fMiIjpoTLP8h6Fu2BRVb66rf5uDtTxwJ/CSiZUq4GKOO88wUo6nH8rhR/SfZTcTNGT4WyKjwWClLPTtIi/wXQNV/7thDzyvyhy72xlQOv0xjLRTn7QxugU0oFlTsWbCg7V61nVRTWgzLK+vye+4XsP2vXHuOKis5magq+D4Kb/vFTjTr37cnfHdcNEDdKWTD99zutP6l81nKhyzE1Noojs5BjlJlUC1sQqcGDa3HBiiyHDe4o2dfva8lM6YVaaGiRU42xEnzkWLRAbIudRLOxEyA/OdD2DRAV28Cxu1jFb8szTbYpvTjOEVHSHC3yc/mGbi/Zx3+TQFTY2qVNQ4qFx1ZLaLVs31/eO8GsXbDzX4Njscnw64VPn+GrZEfigamuBoJbWS0okgQiQiJn2xNmQF/3nu8eoFVwGFqerGt8AfC75dI7Ea7tD2JYCMjRlJ1novnmJJqeWtJLNP85hN1yvBMc9I2ljnJ9SedDjyJnGQFeZJxFV9Lfelzw+iGKx67WDGrWFkvzqiZWiqI9W+rjoL3GI5nyjsR8JSJhdGmDeOT3N2cbPZ6AxwMMy0QaNB jadlBDeu rQ3n8mAJ+7Tq1yLyb+0WGNPwuvdX50dSp8TMk7paWMrQm4C9dz+BV+snwG6Pd7YM0/PeuLI0SteQv2xgNZq42F0NPerhI+BH6DWicxWIizGaerHBNtTBtt6ypGUcB2XQQ5sMuZlRR4QqVA7vQjptSZVelQSsDkHfdm4nDCyh2SPT6d0nDjGPNo0s/LCakgGHMhRaibNICd5WZ/5XNY5k2zImeTm2E0MOYBcTkkSe7V0rep+gwFalMYRJUKbx8cWyC4kRdtrk43ErsoHA0+evX21Ty4H0pP0LTZwHE7mENCmzHXII7TYWXKCLQUzHnseBk0ZRA/GfT1rQ/+vQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 17, 2024 at 11:56=E2=80=AFPM Huang, Ying = wrote: > > Chris Li writes: > > > That is in general true with all kernel development regardless of > > using options or not. If there is a bug in my patch, I will need to > > debug and fix it or the patch might be reverted. > > > > I don't see that as a reason to take the option path or not. The > > option just means the user taking this option will need to understand > > the trade off and accept the defined behavior of that option. > > User configuration knobs are not forbidden for Linux kernel. But we are > more careful about them because they will introduce ABI which we need to > maintain forever. And they are hard to be used for users. Optimizing > automatically is generally the better solution. So, I suggest you to > think more about the automatically solution before diving into a new > option. I did, see my reply. Right now there are just no other options. > > >> > >> >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. > >> > > >> > Me too. I prefer transparent over reservation if it can achieve the > >> > same goal. Do we have a fully transparent method spec out? How to > >> > achieve fully transparent and also avoid fragmentation caused by mix > >> > order allocation/free? > >> > > >> > Keep in mind that we are still in the early stage of the mTHP swap > >> > development, I can have the reservation patch relatively easily. If > >> > you come up with a better transparent method patch which can achieve > >> > the same goal later, we can use it instead. > >> > >> Because we are still in the early stage, I think that we should try to > >> improve transparent solution firstly. Personally, what I don't like i= s > >> that we don't work on the transparent solution because we have the > >> reservation solution. > > > > Do you have a road map or the design for the transparent solution you c= an share? > > I am interested to know what is the short term step(e.g. a month) in > > this transparent solution you have in mind, so we can compare the > > different approaches. I can't reason much just by the name > > "transparent solution" itself. Need more technical details. > > > > Right now we have a clear usage case we want to support, the swap > > in/out mTHP with bigger zsmalloc buffers. We can start with the > > limited usage case first then move to more general ones. > > TBH, This is what I don't like. It appears that you refuse to think > about the transparent (or automatic) solution. Actually, that is not true, you make the wrong assumption about what I have considered. I want to find out what you have in mind to compare the near term solutions. In my recent LSF slide I already list 3 options to address this fragmentation problem. >From easy to hard: 1) Assign cluster an order on allocation and remember the cluster order. (short term). That is this patch series 2) Buddy allocation on the swap entry (longer term) 3) Folio write out compound discontinuous swap entry. (ultimate) I also considered 4), which I did not put into the slide, because it is less effective than 3) 4) migrating the swap entries, which require scan page table entry. I briefly mentioned it during the session. 3) should might qualify as your transparent solution. It is just much harder to implement. Even when we have 3), having some form of 1) can be beneficial as well. (less IO count, no indirect layer of swap offset). > > I haven't thought about them thoroughly, but at least we may think about > > - promoting low order non-full cluster when we find a free high order > swap entries. > > - stealing a low order non-full cluster with low usage count for > high-order allocation. Now we are talking. These two above fall well within 2) the buddy allocators But the buddy allocator will not be able to address all fragmentation issues, due to the allocator not being controlled the life cycle of the swap entry. It will not help Barry's zsmalloc usage case much because android likes to keep the swapfile full. I can already see that. > - freeing more swap entries when swap devices become fragmented. That requires a scan page table to free the swap entry, basically 4). It is all about investment and return. 1) is relatively easy to implement and with good improvement and return. Chris > >> >> >> that's really important for you, I think that it's better to des= ign > >> >> >> something like hugetlbfs vs core mm, that is, be separated from = the > >> >> >> normal swap subsystem as much as possible. > >> >> > > >> >> > I am giving hugetlbfs just to make the point using reservation, o= r > >> >> > isolation of the resource to prevent mixing fragmentation existin= g in > >> >> > core mm. > >> >> > I am not suggesting copying the hugetlbfs implementation to the s= wap > >> >> > system. Unlike hugetlbfs, the swap allocation is typically done f= rom > >> >> > the kernel, it is transparent from the application. I don't think > >> >> > separate from the swap subsystem is a good way to go. > >> >> > > >> >> > This comes down to why you don't like the reservation. e.g. if we= use > >> >> > two swapfile, one swapfile is purely allocate for high order, wou= ld > >> >> > that be better? > >> >> > >> >> Sorry, my words weren't accurate. Personally, I just think that it= 's > >> >> better to make reservation related code not too intrusive. > >> > > >> > Yes. I will try to make it not too intrusive. > >> > > >> >> And, before reservation, we need to consider something else firstly= . > >> >> Whether is it generally good to swap-in with swap-out order? Shoul= d we > >> > > >> > When we have the reservation patch (or other means to sustain mix si= ze > >> > swap allocation/free), we can test it out to get more data to reason > >> > about it. > >> > I consider the swap in size policy an orthogonal issue. > >> > >> No. I don't think so. If you swap-out in higher order, but swap-in i= n > >> lower order, you make the swap clusters fragmented. > > > > Sounds like that is the reason to apply swap-in the same order of the s= wap out. > > In any case, my original point still stands. We need to have the > > ability to allocate high order swap entries with reasonable success > > rate *before* we have the option to choose which size to swap in. If > > allocating a high order swap always fails, we will be forced to use > > the low order one, there is no option to choose from. We can't evalute > > "is it generally good to swap-in with swap-out order?" by actual runs. > > I think we don't need to fight for that. Just prove the value of your > patchset with reasonable use cases and normal workloads. Data will > persuade people.