From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B67FC27C53 for ; Thu, 6 Jun 2024 01:57:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 944976B00A0; Wed, 5 Jun 2024 21:57:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8CDA76B00A4; Wed, 5 Jun 2024 21:57:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 747B56B00A5; Wed, 5 Jun 2024 21:57:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 523746B00A0 for ; Wed, 5 Jun 2024 21:57:24 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id BED731201BE for ; Thu, 6 Jun 2024 01:57:23 +0000 (UTC) X-FDA: 82198801566.18.1FC0920 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.13]) by imf18.hostedemail.com (Postfix) with ESMTP id C9B6F1C0009 for ; Thu, 6 Jun 2024 01:57:19 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="FL1Pn//x"; spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717639041; a=rsa-sha256; cv=none; b=oxzGQMPC8JO92mLFk2UrKEUobRMKK0mcqtkxUZAfxjJ0uc3Pzt/gHAD4KS+tokC+ga6PbT xDFyL4XOSi6+rcFIzMwFINuWX46iUf7ba8NUSNLPjNd3/70bJwv9/Ecu4SaxDcTUQ0MIS7 Jl+s2ujCBihT0DTfjkDBe1FxCiHbGQU= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="FL1Pn//x"; spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717639041; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=S2HFo4mqwalKj8fh+EPCZ+HBPP4u8qecPolLX0PueEQ=; b=gNI/BJVGFdgJuL63nSn7D0D5ueMEGzPV+Fc6NrPsyGAknPNLDvHvDBs3G8tee4UQRmp1Zy IaM5f+E6EZoNJqyZKI05a8I0GzTqcf1mAvoL98FknjhWnrM3Fk1THVIxu31phKg8Yk/L2H PBBQ7a/Mx/eqsw0X1nrz/NK+jmFNloM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1717639041; x=1749175041; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=1+jnYSMue3rbA1K7GRIby88dm6CNWh/oOxG1dvm29rk=; b=FL1Pn//xDXnViy6bGG8GyXfXwvIdXgjL9Gxi7IOxUcnKDZmOv+yrCpf0 9oiyZmM2H7MwE71lAi4iM64P85Iy03t30f2jiBoq4+xAXd71bVnATZrwk QYHnpS8SY/a14jMXsFzpZwhYjzyfScXTb4bSDHfmlXaBWPhl5RTAuwSjY ffJ44QUDUPfkfQXOfeuouRjqwMuItAtZ1plW6a9pIetgZiVq9zARFaLzn VC41deJG32FIBfoLy+9CcSKArRAuiSChc0PYoW/TPCzGqjTU5OQV9EGqp 82uTz9OkMhF6dlfiGk/Q8XSn7IvMTIBKnDj+6auwhNTT5KkM+VkDe16ax g==; X-CSE-ConnectionGUID: lvmAmRVtRRGG4vAU7y8KPA== X-CSE-MsgGUID: vY/19wWFRiKW7ozTj30O+Q== X-IronPort-AV: E=McAfee;i="6600,9927,11094"; a="25393561" X-IronPort-AV: E=Sophos;i="6.08,218,1712646000"; d="scan'208";a="25393561" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by orvoesa105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2024 18:57:18 -0700 X-CSE-ConnectionGUID: 7XySAZOaRGqZvpdpxSGUDw== X-CSE-MsgGUID: UbC28saqSDaLJkLI4E2GTQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,218,1712646000"; d="scan'208";a="42725519" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jun 2024 18:57:15 -0700 From: "Huang, Ying" To: Chris Li Cc: Andrew Morton , Kairui Song , Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order In-Reply-To: (Chris Li's message of "Wed, 5 Jun 2024 00:08:12 -0700") References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org> <87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com> <875xuw1062.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 06 Jun 2024 09:55:24 +0800 Message-ID: <875xum96nn.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C9B6F1C0009 X-Rspam-User: X-Rspamd-Server: rspam12 X-Stat-Signature: 4h6dfesn3uedatbe43fjgabky7rrkqmg X-HE-Tag: 1717639039-653011 X-HE-Meta: U2FsdGVkX18KNmdWON2Qh7rOK3QYuE5EWZ7wg8U28E8UeAVwQMMEk/4ERiEe1hFvTfqrqQvl93s5WZ+1wAOqj4ltNr4okQACg7/wbrWplJENiEY00i2aShhsOVopZSPLF+SCKXUZUj+cnesiWjCG+LHC3Kq0BFOR8dqQ8M4Ntd7lUJ/UGod5+ThxWX60hbkMe1aPH6dJ3zjSJnKPpk0gJ8hZ5oMpNyt0Tz0cJsx9232niTs6PpBi2H+xa0c/7GG0w0eITBSwhVPMvym5GIJGSLHMW2eqfm1T02S0p5Ve4jEr8+qlISVuwdCM/iTid17hqZp0WOrYxId3Uyy7ibFpnL1yGBmO/5CfUIi+5bLjzj8AlFhveXKD/irWhgrIHJXIaU+HQAlz8RCZKAep8+G3fk8ppnMHvSsjNW1cynYQRo7kQnuNDxsz+Bz/Rv9R1cnuqJidB5TYEJyLa4ZW+KAg4N6sKQ032rZ+j+5+ZW0/2tcxyVmMPDKVT0+K/dqjBlXho7MVDp9I7de4gDlZP2osGC06dBmMXS52HtGAJ5jAYh3bva8XyJX37h5f7s5CYMI1GXCET0q8RQlfxvvwtTr/gUGLKj4phR6C7yzhhwM1/5oTtjWsATTHGBx/fl6Xf5pk2c+vP0m4OZuRpcJLRY/1FO78qIf/SH1bmDizEGlhvdGnB/N/JUP5SdaLLheLxGu2cgs0DTfpcGf4jtwM4nzauo8hp90wiz+K+4idS8XVvP2gteS+K1Sqj3OWLRlAHJScaPw9Xj+ScZB3pvY9YwnxTq5tr4jGVTi8si2R/px8ZZdTimorvcoIibAfHznnybnHKD0PX+YMXrT+Rrddd095K8clZEJrGHg4iXg+V4/4PwRgRJIedyO22aYshQOx0eYn3V4hMDtmeTcLjqHvCO3HWuNAfIYXB9nH3O985aIra+mxIoEqdCnpd/ezuTzyEP0SzefHkuN9sCzJkAdIP3X RJdVp05Z DN9Bh58o8gIc+e6lKdbfqkacz/Uy1VM3sSh1ShiNaBhNYWhm13485DJrxbgnON31omDROm5lvaxRN8WkOsD7C5ZrYa62fH0WJk69IVp6MGaF+dMuXk8/NV8IInIvTD34YES9nt7fRv5o+4DglS6iV+2bFypWmRS6Gw+M924RAcgoKdwXLjclRyzEmCKKWtO9NwP6GVyNTwiLcedVektzTqaKWLnB/itKhEfmj X-Bogosity: Ham, tests=bogofilter, spamicity=0.000010, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Chris Li writes: > On Thu, May 30, 2024 at 7:37=E2=80=AFPM Huang, Ying wrote: >> >> Chris Li writes: >> >> > On Wed, May 29, 2024 at 7:54=E2=80=AFPM Huang, Ying wrote: >> > because android does not have too many cpu. We are talking about a >> > handful of clusters, which might not justify the code complexity. It >> > does not change the behavior that order 0 can pollut higher order. >> >> I have a feeling that you don't really know why swap_map[] is scanned. >> I suggest you to do more test and tracing to find out the reason. I >> suspect that there are some non-full cluster collection issues. > > Swap_map[] is scanned because of running out of non full clusters. > This can happen because Android tries to make full use of the swapfile. > However, once the swap_map[] scan happens, the non full cluster is pollut= ed. > > I currently don't have a local reproduction of the issue Barry reported. > However here is some data point: > Two swap files, one for high order allocation only with this patch. No > fall back. > If there is a non-full cluster collection issue, we should see the > fall back in this case as well. > > BTW, same setup without this patch series it will fall back on the > high order allocation as well. > >> >> >> Another issue is nonfull_cluster[order1] cannot be used for >> >> nonfull_cluster[order2]. In definition, we should not fail order 0 >> >> allocation, we need to steal nonfull_cluster[order>0] for order 0 >> >> allocation. This can avoid to scan swap_map[] too. This may be not >> >> perfect, but it is the simplest first step implementation. You can >> >> optimize based on it further. >> > >> > Yes, that is listed as the limitation of this cluster order approach. >> > Initially we need to support one order well first. We might choose >> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages >> > are too big. The sweet spot might be some there in between. If we can >> > support one order well, we can demonstrate the value of the mTHP. We >> > can worry about other mix orders later. >> > >> > Do you have any suggestions for how to prevent the order 0 polluting >> > the higher order cluster? If we allow that to happen, then it defeats >> > the goal of being able to allocate higher order swap entries. The >> > tricky question is we don't know how much swap space we should reserve >> > for each order. We can always break higher order clusters to lower >> > order, but can't do the reserves. The current patch series lets the >> > actual usage determine the percentage of the cluster for each order. >> > However that seems not enough for the test case Barry has. When the >> > app gets OOM kill that is where a large swing of order 0 swap will >> > show up and not enough higher order usage for the brief moment. The >> > order 0 swap entry will pollute the high order cluster. We are >> > currently debating a "knob" to be able to reserve a certain % of swap >> > space for a certain order. Those reservations will be guaranteed and >> > order 0 swap entry can't pollute them even when it runs out of swap >> > space. That can make the mTHP at least usable for the Android case. >> >> IMO, the bottom line is that order-0 allocation is the first class >> citizen, we must keep it optimized. And, OOM with free swap space isn't >> acceptable. Please consider the policy we used for page allocation. > > We need to make order-0 and high order allocation both can work after > the initial pass of allocating from empty clusters. > Only order-0 allocation work is not good enough. > > In the page allocation side, we have the hugetlbfs which reserve some > memory for high order pages. > We should have similar things to allow reserve some high order swap > entries without getting polluted by low order one. TBH, I don't like the idea of high order swap entries reservation. If that's really important for you, I think that it's better to design something like hugetlbfs vs core mm, that is, be separated from the normal swap subsystem as much as possible. >> >> > Do you see another way to protect the high order cluster polluted by >> > lower order one? >> >> If we use high-order page allocation as reference, we need something >> like compaction to guarantee high-order allocation finally. But we are >> too far from that. > > We should consider reservation for high-order swap entry allocation > similar to hugetlbfs for memory. > Swap compaction will be very complicated because it needs to scan the > PTE to migrate the swap entry. It might be easier to support folio > write out compound discontiguous swap entries. That is another way to > address the fragmentation issue. We are also too far from that as > right now. That's not easy to write out compound discontiguous swap entries too. For example, how to put folios in swap cache? >> >> For specific configuration, I believe that we can get reasonable >> high-order swap entry allocation success rate for specific use cases. >> For example, if we only do limited maximum number order-0 swap entries >> allocation, can we keep high-order clusters? > > Yes we can. Having a knob to reserve some high order swap space. > Limiting order 0 is the same as having some high order swap entries > reserved. > > That is a short term solution. -- Best Regards, Huang, Ying