From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 200B9C27C75 for ; Thu, 13 Jun 2024 08:40:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B26476B00A8; Thu, 13 Jun 2024 04:40:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AD6116B00A9; Thu, 13 Jun 2024 04:40:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 99DEF6B00AA; Thu, 13 Jun 2024 04:40:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 7C8FB6B00A8 for ; Thu, 13 Jun 2024 04:40:53 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E7665120A3D for ; Thu, 13 Jun 2024 08:40:52 +0000 (UTC) X-FDA: 82225219944.05.7178339 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21]) by imf09.hostedemail.com (Postfix) with ESMTP id 37504140008 for ; Thu, 13 Jun 2024 08:40:49 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=ICb0jYG3; spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718268050; a=rsa-sha256; cv=none; b=dld7MaJaUbadRd4/CR/h68yS8+0rfKN8u5ZuMyZ9GmkDMZpvnZqwHQUAdJHJrGboLFj0Zt Trk8fl4ycpzZbrZ4OesaQ6HYJT8uaYZpKT7L2OSThVIk9HyC5b3OjjfugSxi/6UF34DWWF qw1SJNU9JNzzl2mhi0VkA6js7U1PzrA= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=ICb0jYG3; spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718268050; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TJrbrQV2QboP9lZ95hNwxnSdB6kp7Pk38Q6XEPolQJU=; b=pEJarvsxTIWD2qahsMlA2CgRxHGeZVMEa+x7auydlIcVzTDXZdSl3NQ2dNZidKBzHBloj0 qj5FmV89jKdeJEd4cytKmOBXVPQ5DjH7l0pajjj8JMMYaE0ceU/AuML7hlihJoOieFTzJe BsA52kf5PaXko9mxmNZ5nr7gT6Gkooo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1718268050; x=1749804050; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=BufJUpD5h99/QMmWwXqc0nJMKvoaciS9R2964RYmPDQ=; b=ICb0jYG3IBwUyVGB+Yiwq3DWAZkPfOFKO1wPmhnYtWfEuqDjl0Vhl0Kf qqU0HX2PZEMvqpXoyDljBkajoULzwX9s4xLS1fLbiNYhDLa41HKPI2te+ BaGB9g+CxUcth8H3T8/Ps52x97ewu/fFyNl/tDWiPfifAe0vKwipDN71Z YlCTvyiOGb/fyXl+TIL8iGWwgevg7pj05QWN0a7+akfCfUpA355dGuwsg GliHWng9u51VqM98MtnoguU0k3KwjFup1wo4MHueG7+fJVgMCG6rvoBrh k9PJlE9vlA5P+PewHJ2sp5Mxx+o0y/B7X3PK8A3l2bBeA/Sak0sP09oEh Q==; X-CSE-ConnectionGUID: qishylgkT9WAb+jsBewrzA== X-CSE-MsgGUID: VyhFT0F7Sc+s7VgTZQKgeg== X-IronPort-AV: E=McAfee;i="6700,10204,11101"; a="15038182" X-IronPort-AV: E=Sophos;i="6.08,234,1712646000"; d="scan'208";a="15038182" Received: from orviesa004.jf.intel.com ([10.64.159.144]) by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Jun 2024 01:40:49 -0700 X-CSE-ConnectionGUID: y6ZThw5fQ/qC4Z8Tk3LApA== X-CSE-MsgGUID: oiRJwbZRTGmKIw26B63CQw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,234,1712646000"; d="scan'208";a="45197867" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Jun 2024 01:40:47 -0700 From: "Huang, Ying" To: Chris Li Cc: Andrew Morton , Kairui Song , Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order In-Reply-To: (Chris Li's message of "Tue, 11 Jun 2024 00:11:42 -0700") References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org> <87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com> <875xuw1062.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com> <875xum96nn.fsf@yhuang6-desk2.ccr.corp.intel.com> <87wmmw6w9e.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 13 Jun 2024 16:38:55 +0800 Message-ID: <87a5jp6xuo.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Stat-Signature: ubtiw5y54ay57ykscyz8px6mnqgewrrc X-Rspamd-Queue-Id: 37504140008 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1718268049-162290 X-HE-Meta: U2FsdGVkX19ECrCRotZ7DPUCS7mI+VIWAOaB2DygVlsLfTpk7eN00A9abaOj1ZEhzNor/jRc7DfuLR5ZQchmaQGkOSXHrAXceseYpcE6F4drDc+AZsVMSFFzxJrQOnzg5Fa//G/Hqi7yqghkb0PnFVoDnoNwg2touePzgv/+egXPP+Y3fJDqCo44Tgq8slLhJMtMYGaIzB3/XelfCmVRZ4sXbgI5oG9Tk99N+YotKj97Hhs29ExFKputi1wk1+IJLQZdMFyVuajVpB+v62WJ4L5zKPsy2f1WvnYBw4mi+XDmkqocvbUgDvPibXuJFzM0HU/3zXFMGLTzOtIugfQduGisTVC6gTRRAHGKYWJgCZpZzn/k+3W+sSErUHoFsZ99JUUvJjVFv5oRCwvxpnedth47QKaJkcNvvtomRqw0Y5OtMwsu/f0fPqCyY/9G4Y+F7pFvS43ndDQas1NgyAXPtdIZqKJ22cpgp4crbA3883MSI0V4L8SSl/wSqaPLl5tQbqx1ytBE8XxmziZzOUgR1f1hylHC+N6jj4noxotwF8+TlbUZFylKGi//GBBdHigA9rWyYX3b8lA0mex1Nc6a6POh/w5MTUBcO/6R146A/yLUsFpbszXFNJX41ljRxinAqHlrefaGMBq8qFWujTffg+EJNgMBjI5tIPp0Vxr2Rs51OHHgLCHPYcR3ezZJ5wzk06qCBQjzzYkA7TcQXb+P+v/K1UiwkqiP6C8q8OpahVLEilVQJx9Q/aVwv00MUJMkw6QrfiUcPTDCBiDEXZIGlBQanpVX4oMTiREk0kC4qgeRls0sBLtbm/okfvYxIfR+kUa2/KyjWlGDMBGx4yWjG9OWeJ7boU1xHAtMo7IwQAHTsweGRFzlVKa/gF/AMv+hemJo1wGRut+DfBWPLa0bQjvxtPOznsxlPdz3JaXRTRdi2q51ICywp8iif8X1yiQgDzI4b7UoxiT/2W8PMLc 4HErVJa+ 5qHVX4v7k+GzRk8hGwDlTs2Caq0t91xbs1BXZ+HMFvBZi63tQJnQjZiL+0LUOP82uu+bky+aXhPi3SNDcJGxHNVbtQ0MHALvsOQiEfPNA1Lo3kh2GE8KB2OaLzpgs0efVkwc7NVSkLXhVPRVTfknbAHZl9uVUoKbb07bppWsQV+IvJoqWzUd9BizN57YQfYar4byBgMD92pila7BAPqOFKPwjIrbCJ999KqrT2FIlhldYcL+GmOG2oSn3opuTVCFRLw4Jt7lR/7sgWutdtJg3vCQmcw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Chris Li writes: > On Mon, Jun 10, 2024 at 7:38=E2=80=AFPM Huang, Ying wrote: >> >> Chris Li writes: >> >> > On Wed, Jun 5, 2024 at 7:02=E2=80=AFPM Huang, Ying wrote: >> >> >> >> Chris Li writes: >> >> >> > >> >> > In the page allocation side, we have the hugetlbfs which reserve so= me >> >> > memory for high order pages. >> >> > We should have similar things to allow reserve some high order swap >> >> > entries without getting polluted by low order one. >> >> >> >> TBH, I don't like the idea of high order swap entries reservation. >> > May I know more if you don't like the idea? I understand this can be >> > controversial, because previously we like to take the THP as the best >> > effort approach. If there is some reason we can't make THP, we use the >> > order 0 as fall back. >> > >> > For discussion purpose, I want break it down to smaller steps: >> > >> > First, can we agree that the following usage case is reasonable: >> > The usage case is that, as Barry has shown, zsmalloc compresses bigger >> > size than 4K and can have both better compress ratio and CPU >> > performance gain. >> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.= com/ >> > >> > So the goal is to make THP/mTHP have some reasonable success rate >> > running in the mix size swap allocation, after either low order or >> > high order swap requests can overflow the swap file size. The allocate >> > can still recover from that, after some swap entries got free. >> > >> > Please let me know if you think the above usage case and goal are not >> > reasonable for the kernel. >> >> I think that it's reasonable to improve the success rate of high-order > > Glad to hear that. > >> swap entries allocation. I just think that it's hard to use the >> reservation based method. For example, how much should be reserved? > > Understand, it is harder to use than a fully transparent method, but > still better than no solution at all. The alternative right now is we > can't do it. > > Regarding how much we should reserve. Similarly, how much should you > choose your swap file size? If you choose N, why not N*120% or N*80%? > That did not stop us from having a swapfile, right? > >> Why system OOM when there's still swap space available? And so forth. > > Keep in mind that the reservation is an option. If you prefer the old > behavior, you don't have to use the reservation. That shouldn't be a > reason to stop others who want to use it. We don't have an alternative > solution for the long run mix size allocation yet. If there is, I like > to hear it. It's not enough to make it optional. When you run into issue, you need to debug it. And you may debug an issue on a system that is configured by someone else. >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. > > Me too. I prefer transparent over reservation if it can achieve the > same goal. Do we have a fully transparent method spec out? How to > achieve fully transparent and also avoid fragmentation caused by mix > order allocation/free? > > Keep in mind that we are still in the early stage of the mTHP swap > development, I can have the reservation patch relatively easily. If > you come up with a better transparent method patch which can achieve > the same goal later, we can use it instead. Because we are still in the early stage, I think that we should try to improve transparent solution firstly. Personally, what I don't like is that we don't work on the transparent solution because we have the reservation solution. >> >> >> that's really important for you, I think that it's better to design >> >> something like hugetlbfs vs core mm, that is, be separated from the >> >> normal swap subsystem as much as possible. >> > >> > I am giving hugetlbfs just to make the point using reservation, or >> > isolation of the resource to prevent mixing fragmentation existing in >> > core mm. >> > I am not suggesting copying the hugetlbfs implementation to the swap >> > system. Unlike hugetlbfs, the swap allocation is typically done from >> > the kernel, it is transparent from the application. I don't think >> > separate from the swap subsystem is a good way to go. >> > >> > This comes down to why you don't like the reservation. e.g. if we use >> > two swapfile, one swapfile is purely allocate for high order, would >> > that be better? >> >> Sorry, my words weren't accurate. Personally, I just think that it's >> better to make reservation related code not too intrusive. > > Yes. I will try to make it not too intrusive. > >> And, before reservation, we need to consider something else firstly. >> Whether is it generally good to swap-in with swap-out order? Should we > > When we have the reservation patch (or other means to sustain mix size > swap allocation/free), we can test it out to get more data to reason > about it. > I consider the swap in size policy an orthogonal issue. No. I don't think so. If you swap-out in higher order, but swap-in in lower order, you make the swap clusters fragmented. >> consider memory wastage too? One static policy doesn't fit all, we may >> need either a dynamic policy, or make policy configurable. >> In general, I think that we need to do this step by step. > > The core swap layer needs to be able to sustain mix size swap > allocation free in the long run. Without that the swap in size policy > is meaningless. > > Yes, that is the step by step approach. Allowing long run mix size > swap allocation as the first step. > >> >> >> > Do you see another way to protect the high order cluster pollute= d by >> >> >> > lower order one? >> >> >> >> >> >> If we use high-order page allocation as reference, we need somethi= ng >> >> >> like compaction to guarantee high-order allocation finally. But w= e are >> >> >> too far from that. >> >> > >> >> > We should consider reservation for high-order swap entry allocation >> >> > similar to hugetlbfs for memory. >> >> > Swap compaction will be very complicated because it needs to scan t= he >> >> > PTE to migrate the swap entry. It might be easier to support folio >> >> > write out compound discontiguous swap entries. That is another way = to >> >> > address the fragmentation issue. We are also too far from that as >> >> > right now. >> >> >> >> That's not easy to write out compound discontiguous swap entries too. >> >> For example, how to put folios in swap cache? >> > >> > I propose the idea in the recent LSF/MM discussion, the last few >> > slides are for the discontiguous swap and it has the discontiguous >> > entries in swap cache. >> > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view >> > >> > Agree it is not an easy change. The cache cache would have to change >> > the assumption all offset are contiguous. >> > For swap, we kind of have some in memory data associated with per >> > offset already, so it might provide an opportunity to combine the >> > offset related data structure for swap together. Another alternative >> > might be using xarray without the multi entry property. , just treat >> > each offset like a single entry. I haven't dug deep into this >> > direction yet. >> >> Thanks! I will study your idea. >> > > I am happy to discuss if you have any questions. > >> > We can have more discussion, maybe arrange an upstream alignment >> > meeting if there is interest. >> >> Sure. > > Ideally, if we can resolve our differences over the mail list then we > don't need to have a separate meeting :-) > -- Best Regards, Huang, Ying