From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72ADEC27C53 for ; Wed, 19 Jun 2024 09:23:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D850A8D004F; Wed, 19 Jun 2024 05:23:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D34AF6B02AE; Wed, 19 Jun 2024 05:23:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BFC778D004F; Wed, 19 Jun 2024 05:23:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 9BE7C6B02AC for ; Wed, 19 Jun 2024 05:23:53 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 567BEA262D for ; Wed, 19 Jun 2024 09:23:53 +0000 (UTC) X-FDA: 82247101146.18.E39B727 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14]) by imf14.hostedemail.com (Postfix) with ESMTP id AA9B5100008 for ; Wed, 19 Jun 2024 09:23:50 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=kPAfwlz8; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718789027; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AFk2WLYMm07xdJpH7/g0naw9PFz4dX2I0d7xcEzu4a0=; b=3cSFx12QQwP5qUVMmLbmXq5f8s14MAWS/EfICv6HOBpX/Pmjr2xgtK92Jan8DGui7GjEJg BP8HqxLSR7sy0Mf7KPB5H5NnctR1a1H48ukykAmJomMjohUZ98JiJwUg9crNUsAjsPDVOc YDrjipqqtLbQRQZKsEx/Y9FdKWexvHk= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=kPAfwlz8; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718789027; a=rsa-sha256; cv=none; b=6Ulg6UT7SZnc8yLAJq2MT4OeKj8t2FqRGejzZzIysQCUldbdKe4mnMQnWgCBEhJzgFL9/b Xv9RF5XtkCTs7vttCtI9jIZ53mDqRrYO7ZorgRh6E8Hn98klEhHLnyUCcdHJkewAyPPv4Z mq+VczEdS91BKzmJM+m7PMCP8AsnbpM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1718789031; x=1750325031; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=3pWtozG4IMuq9Ub5jQk7zLUZAkkC0MkHpKVPAA24bs8=; b=kPAfwlz8Cr/alTfa2I7YmOw2v57I58IZHR9hLoCLfVRWT95kHxyvGkA3 KetZdwK4GhEREMATmKmtHMghKV5MrKRCpbTDu6lvQY/iwyXy/XirUarr9 HaEeu7JVF1q9zsm48KyR7knJnkXKRPLY+TKLaEzL9JgSmvURDwZogqfOG JShy4YyGDrGess5CRq38qUuuClFYwy+0GIfQnMpdsTfFvk4cfEBG7cBPJ D+0b78WWhBeBu1V5E4SGC2jXZ9Je7UZSaLevHV68UZwLlXXuKV1DQvNGx nXmt1K9U+tCKYKdWD4OEH6AwBauQojp/h7ZIuFiDAEXmyXNZVGT5BWVqg g==; X-CSE-ConnectionGUID: f7z89Dt7QNqjWJopVN2tOA== X-CSE-MsgGUID: 05bBvRJHQByvL1rN0pupTQ== X-IronPort-AV: E=McAfee;i="6700,10204,11107"; a="15949828" X-IronPort-AV: E=Sophos;i="6.08,250,1712646000"; d="scan'208";a="15949828" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jun 2024 02:23:49 -0700 X-CSE-ConnectionGUID: jHs1NCVvQHSd++SvkXRrQw== X-CSE-MsgGUID: m8edpKv5RzOpdSmXbbo0YQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,250,1712646000"; d="scan'208";a="41944409" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jun 2024 02:23:48 -0700 From: "Huang, Ying" To: Chris Li Cc: Andrew Morton , Kairui Song , Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order In-Reply-To: (Chris Li's message of "Tue, 18 Jun 2024 02:31:58 -0700") References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org> <87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com> <875xuw1062.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com> <875xum96nn.fsf@yhuang6-desk2.ccr.corp.intel.com> <87wmmw6w9e.fsf@yhuang6-desk2.ccr.corp.intel.com> <87a5jp6xuo.fsf@yhuang6-desk2.ccr.corp.intel.com> <8734pa68rl.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Wed, 19 Jun 2024 17:21:56 +0800 Message-ID: <87h6dp479n.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: AA9B5100008 X-Stat-Signature: 88ypbfrckxo4dj7ojm4hc9jp4f7kz44y X-HE-Tag: 1718789030-166997 X-HE-Meta: U2FsdGVkX1+I93HXaKpjQRvEVUZKiAvA8kjAkceRZeIYyWmYvyGrsIqKahQEYmSxMKMDgLpD1GiK51yTbnO6vTrUQHkCH1jB5gu7hKQ6CknDMmhThJUvRQ+oVnm0DBI5BRA6ZGdpmA16aRah5z6cDPiF3Gns+ebtiqM5mURegdqkGGkWocLZCXPSvwW5J0fw/YNWAwl7e5n3EJHIN6tsIP70bCTqadNCCt52xLu9QO76xK/GLpnyjZ3wEC/S+Yrjy+VDu9GroiXQ2/VQih1AoMLfePwwboTjXxS9OD76n4VP4MC+dBOtytvSFAEWJ0eIoVA0MF5pk2qfPMz7f8VY7UV7XzLehpU1BkU1n5Azxu5Q4xvIi/nq0YKhwWw2dfBLX71dMpPBa8jjVjLx8BaPuMMssfEyI+2daRmNuuDqOPVPMovhKdO5dh21+fwW4km1EIXbCZgtv+XoqdOV6s0QnGVAPUW0l46oLWHwGHfvN+oJTuOnzUmhByqAGwtWNM8OW21DbYd6BfvtpScKsgIbkDU/BkWfjjk5I08x+YGlTWppArIf5MP/++6vCHvBKWOGD9h4OkS9kIGG+Az2mbaXX8tVBZo+4VI1Wk7fIoE7ABQlkxZxIJ6x+DvaVvNIUHkpHwokPHYbKpoUqKmtkp0rarJ/hjCC39xWStKZYz6+15JQzjSSzNewtml9wCdlDufuZlp0Jh9w0DcqYYEsK0iE2WCVkHTHrwU8h5jgi3uVoUn/wzwVm7ALeIRd+7Fn5nAlyhH3QNOu6lzsKc98Y2hHJoM26SChYYtEX2+P+NFbEAO16wqfSbZ2+WTpxNtFXfnitY4O7L2/YlOzk5aHumtOV9tDNPP5aD1AosWtcv9b0sYPnl6gTy0ru09n1LTHv2UbCYU2/bkLLAEg9TXzZKb8ecCUP8PZ6Pk+LE8j05cTv1QM1NaFJIkdjhzYJ7IIBVIZF1KbOM/a4AEEwdNDHbk s/d2cB/0 FfN+wTQEjAZ+FqhB2zgD8Rm7nisuuDcKHU/hw8mequ/Bmsq/hsBqeg01fA60dF4DEKXsu7wRZIphbKsfCbolHWZcyCVq14GMTdehBXdoQEiSLAA7aDy1vcrBqOoIm4+hZwCdFiG556hSRUBRGn+i5f0SgeQ1UxcOBGlFSqVb/TWQ2RfTdjyPJbMPe0QMARF57yjrNpjhW3TMwOTqa8nBQTX3wF3Vuc2+Upjek X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Chris Li writes: > On Mon, Jun 17, 2024 at 11:56=E2=80=AFPM Huang, Ying wrote: >> >> Chris Li writes: >> >> > That is in general true with all kernel development regardless of >> > using options or not. If there is a bug in my patch, I will need to >> > debug and fix it or the patch might be reverted. >> > >> > I don't see that as a reason to take the option path or not. The >> > option just means the user taking this option will need to understand >> > the trade off and accept the defined behavior of that option. >> >> User configuration knobs are not forbidden for Linux kernel. But we are >> more careful about them because they will introduce ABI which we need to >> maintain forever. And they are hard to be used for users. Optimizing >> automatically is generally the better solution. So, I suggest you to >> think more about the automatically solution before diving into a new >> option. > > I did, see my reply. Right now there are just no other options. > >> >> >> >> >> >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. >> >> > >> >> > Me too. I prefer transparent over reservation if it can achieve the >> >> > same goal. Do we have a fully transparent method spec out? How to >> >> > achieve fully transparent and also avoid fragmentation caused by mix >> >> > order allocation/free? >> >> > >> >> > Keep in mind that we are still in the early stage of the mTHP swap >> >> > development, I can have the reservation patch relatively easily. If >> >> > you come up with a better transparent method patch which can achieve >> >> > the same goal later, we can use it instead. >> >> >> >> Because we are still in the early stage, I think that we should try to >> >> improve transparent solution firstly. Personally, what I don't like = is >> >> that we don't work on the transparent solution because we have the >> >> reservation solution. >> > >> > Do you have a road map or the design for the transparent solution you = can share? >> > I am interested to know what is the short term step(e.g. a month) in >> > this transparent solution you have in mind, so we can compare the >> > different approaches. I can't reason much just by the name >> > "transparent solution" itself. Need more technical details. >> > >> > Right now we have a clear usage case we want to support, the swap >> > in/out mTHP with bigger zsmalloc buffers. We can start with the >> > limited usage case first then move to more general ones. >> >> TBH, This is what I don't like. It appears that you refuse to think >> about the transparent (or automatic) solution. > > Actually, that is not true, you make the wrong assumption about what I > have considered. I want to find out what you have in mind to compare > the near term solutions. Sorry about my wrong assumption. > In my recent LSF slide I already list 3 options to address this > fragmentation problem. > From easy to hard: > 1) Assign cluster an order on allocation and remember the cluster > order. (short term). > That is this patch series > 2) Buddy allocation on the swap entry (longer term) > 3) Folio write out compound discontinuous swap entry. (ultimate) > > I also considered 4), which I did not put into the slide, because it > is less effective than 3) > 4) migrating the swap entries, which require scan page table entry. > I briefly mentioned it during the session. Or you need something like a rmap, that isn't easy. > 3) should might qualify as your transparent solution. It is just much > harder to implement. > Even when we have 3), having some form of 1) can be beneficial as > well. (less IO count, no indirect layer of swap offset). > >> >> I haven't thought about them thoroughly, but at least we may think about >> >> - promoting low order non-full cluster when we find a free high order >> swap entries. >> >> - stealing a low order non-full cluster with low usage count for >> high-order allocation. > > Now we are talking. > These two above fall well within 2) the buddy allocators > But the buddy allocator will not be able to address all fragmentation > issues, due to the allocator not being controlled the life cycle of > the swap entry. > It will not help Barry's zsmalloc usage case much because android > likes to keep the swapfile full. I can already see that. I think that buddy-like allocator (not exactly buddy algorithm) will help fragmentation. And it will help more users because it works automatically. I don't think they are too hard to be implemented. We can try to find some simple solution firstly. So, I think that we don't need to push them to long term. At least, they can be done before introducing high-order cluster reservation ABI. Then, we can evaluate the benefit and overhead of reservation ABI. >> - freeing more swap entries when swap devices become fragmented. > > That requires a scan page table to free the swap entry, basically 4). No. You can just scan the page table of current process in do_swap_page() and try to swap-in and free more swap entries. That doesn't work well for the shared pages. However, I think that it can help quite some workloads. > It is all about investment and return. 1) is relatively easy to > implement and with good improvement and return. [snip] -- Best Regards, Huang, Ying