From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C622C25B75 for ; Fri, 31 May 2024 12:40:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E3B126B00A0; Fri, 31 May 2024 08:40:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DE8256B00A1; Fri, 31 May 2024 08:40:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CD8226B00A2; Fri, 31 May 2024 08:40:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id AF78A6B00A0 for ; Fri, 31 May 2024 08:40:32 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 5D2FDA2CAE for ; Fri, 31 May 2024 12:40:32 +0000 (UTC) X-FDA: 82178649504.28.31CA914 Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) by imf08.hostedemail.com (Postfix) with ESMTP id 84A70160016 for ; Fri, 31 May 2024 12:40:30 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QPKRBCuQ; spf=pass (imf08.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.179 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717159230; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QfnILgMvH7GxtH3T/3UXsRnKvHnmj7OrwjW6Wjae0T4=; b=HC1900rHCWDbtJytF11v8gf9loZGtehtzgKQaVD8uzwuTvUgFhJxiQHNsjYXnUJwVvaLPG Mxz8pi9Y943B+3ZNHKgwnWblM/OF2+6v2UkENgfJkDdnCjnsCHmXIltcqAip9izcqfSeTo ag/RX3blCkxxKUuxgTqdoA/jo4r+ArA= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QPKRBCuQ; spf=pass (imf08.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.179 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717159230; a=rsa-sha256; cv=none; b=UnjaAB41rkMiyvtvqhvho6BLbGbJeC7o3jB9N/Y4rMbZqn5zuxHxXwg0/S21c5OrjKLVI0 VQb8TQ01S4QssjYe+uLs32o1w9WROrcZhDJRTPWx7QbGEmfp4V5G/RcfJFg6tvJ70I0dpc mm6F4ChouNWxl9g34uoF/ib8ly2k/6M= Received: by mail-lj1-f179.google.com with SMTP id 38308e7fff4ca-2ea8fff1486so10315571fa.1 for ; Fri, 31 May 2024 05:40:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1717159228; x=1717764028; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=QfnILgMvH7GxtH3T/3UXsRnKvHnmj7OrwjW6Wjae0T4=; b=QPKRBCuQLIoLeBN9/UZBa+j4/Bk2ARyYCtEOOpypDF+zYzpypzD1870CxgxitvMOod aXvhsmGraxSUqTQftWs3rfBdBGiOBkmLK3a2coCnraNLziwqPsyZxnpOLtxdxw9GOGM3 7xcOc/4IixiYDfHpyCzpH+hPTzt219E3wAS5uxx5Jvoqv+g6mfIX7S2Q4hruCQJp+VRy MD2vL9jz5HNPRwnj32Mg4aosHh11MLluU1bHCe9ku125ke3dQ18Ba77p42zBdDQ6FW0A 6pNQgeGd5zgtvNN4+4BCyJGITIsxuk4kZ9GUgQjaOLWVumh0N89Yh/lcx4Xf+JziTdu2 CTUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717159228; x=1717764028; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=QfnILgMvH7GxtH3T/3UXsRnKvHnmj7OrwjW6Wjae0T4=; b=sH+tGr7iSvKnfG1OSBrYteWKICMXu6RuSUW1Lv37D+CBb0xEaodmFv2qw2dO0MLRA+ 0cYZ7CnuNDFWduXGk6Xfyzfdy8Bh/TWTJk8ndaZ6Ks4R4E4D5OVwa2g+RgPB/9lsMYvS uIEZMwI/PK92KCuy1PD5ROz3hBpJ7TuxaqG8bxTbrrbVoddJvJzuvtrrya1vLQdKW9xw vPsVb/v6NuDYWgg61AoLVsnFT3H4ORye0nKNU2+IaVjqdtJ5gMjULGYwir1WpDCUX/cp Nsa9QQcaCY9WyN0aPkL+MPFVQ6Rfm6HdiPAbmwXRXQeLldJP7gg2hf2U9uM7skO2KJFT 52Ew== X-Forwarded-Encrypted: i=1; AJvYcCXpKnxLL690af/MraAce0v8oGLum1iRzbsATF/KrCd3kzJvIqvJQjxu90xGZUnWIpw5kzMMgJxQebBESyadB6wez4o= X-Gm-Message-State: AOJu0YzqJ0tfZ/bySfgGAXIs+opORr57yzhqWtqclIhtrdUTiD+qqdDj oMPsJTcR39F4BxtMO2BSFSyZGN+fWoS8FFfzjbrYKz9xuHqUvQEI0/x/5mKW/w5nRKTkxJfnK2x ifrNZwqPVx5t+NPsGtVEM6p+P3zI= X-Google-Smtp-Source: AGHT+IHS2MTlto8RHe3AquZRlNJ0TIGzIQU7XbotZgQe2qCpma2vav7K0JxoFnoSzDubHtZ1cyUJCwZ7qsWSeh5ZrP0= X-Received: by 2002:a2e:9ac4:0:b0:2e1:c97b:6f25 with SMTP id 38308e7fff4ca-2ea950a723cmr5984231fa.1.1717159228227; Fri, 31 May 2024 05:40:28 -0700 (PDT) MIME-Version: 1.0 References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org> <87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com> <875xuw1062.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Kairui Song Date: Fri, 31 May 2024 20:40:11 +0800 Message-ID: Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order To: "Huang, Ying" Cc: Chris Li , Andrew Morton , Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 84A70160016 X-Stat-Signature: sxdb66fi3h7a1yuxba3g4e87kpds6qra X-Rspam-User: X-HE-Tag: 1717159230-880801 X-HE-Meta: U2FsdGVkX18W8EBA9jXPVXTMs9no3aHDI4XAXzYuwXTE8DTdU62J1cR1Ylipopj7iLZsU92Wa9Rzv2ucU3FJki+lt8m5z3Mtazsrgr+02UMzKj86FioQ/CUFK1nIS2kzeIH6pnvCBVSDiE6B/KFhMDV0cECjIoWVuifLApsBeriqw7SDlZMl8xwMtDW/lZqGwBwYFN7ybpwsri+zCznYXBruZkXenuA8knFe2Yb9CaHEpzCv/Oy/6QROI/zdnzdeE8mASP+28mB7cwZP/qDJyscBz/3+QHI5KeNcg/90aeGTLKmh6LzeNGG/tQlYCwQv12XpVkbWDENGZ9czw4Kp8dfi4B4XPD8sbc6daI50zS8obqDfypd5GSmU+/AnsB9S5rpVeSQ8kXbeZNSPBf6eIGbrFUqG3EqBrWkZEjBy7eBLbsoXxBDr/FC7dqwyI19pBkFT9XOEZltcRhinaYdUTLtCVo+LCPANKSRt0HMUxoyKXEhYaV2n+IxgLbRwtFfEzmprRPY5WRo+afIVMiy+oshgZ2ymtKjskWndEC2nUHSuCzj1PY0tKa8YgvEDsjm8afq3ByxqazOtORjtfzJgiD+ulcySAly+LXsc/YPxWG/G+kVbNGY1zYmCY6XsxCrFcx3z/wfACEXR8DEa4QnSEjhDg3/Ld7CDc4WLNjBsnD9mh5HHKoQRU1mbFfgXpQO97aGkaaY841wJydXqu3Px4uXli5lvqvhUOEkJnGJmIYWtc7ePKGAVDggoEMucfFSdDCUnPTG9xPyC5cXFc34o9zQlLmzwj3RV+QgtsyGJK/9nNOAb9mnXwMEfcHvdWnBXHX2uyNWjAMqRLbemG4vK1eegkXcUBEQ+Gm+acPTNTTkPbR1YA8xMGaZIbdk5DbSaZfNpc50LaRnF8gHYDP1icp40yww6IiAYmKNXNdY5m7za7Wv1xCGca7u5FB0xFu1TJOZGWRnFHANIIqSfhlH zQhupSOd H0uDUHvWqhMFJ+koUudnNmR40hA+i4XhAtGSyip4+rsuFj1cAEcdCg2922bdMc6UO1PWuDHLIH/b+8Ih67USsY81pggEPOGB2z10If+gE6Ieyvx7wo9X+UTjlL5honIAUGRGw9wwvHhS3aT9muM3mhj6xuN0q1n2NSjmsGtm4JVPjuVjLndXVfJMN2x0NtW6I61cfAf9OoM4wyL41wY4HSBLJWYRS1aD86anoXxtkrmeYFfnWKk2CKlVAgpwV6Pq2i+w0fHEkTSkvkTPaKY5EkwLnBcz7ODQCBnE02Aqb6feE2ADnzEnymZ5GylXSd5JHbo8naO7BRhhO71i+BNExmC02NwE63tPzDn7I+Im1GlXOfuztNoQU2zbD+8Y55F6fe26oSjo5gYMXOT8FDMQyMBDT4U2BOyqnQhJcz2xHkf6GrAsJkpG860B1Hw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.001542, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 31, 2024 at 10:37=E2=80=AFAM Huang, Ying = wrote: > > Chris Li writes: > > > On Wed, May 29, 2024 at 7:54=E2=80=AFPM Huang, Ying wrote: > >> > >> Chris Li writes: > >> > >> > Hi Ying, > >> > > >> > On Wed, May 29, 2024 at 1:57=E2=80=AFAM Huang, Ying wrote: > >> >> > >> >> Chris Li writes: > >> >> > >> >> > I am spinning a new version for this series to address two issues > >> >> > found in this series: > >> >> > > >> >> > 1) Oppo discovered a bug in the following line: > >> >> > + ci =3D si->cluster_info + tmp; > >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > >> >> > That is a serious bug but trivial to fix. > >> >> > > >> >> > 2) order 0 allocation currently blindly scans swap_map disregardi= ng > >> >> > the cluster->order. > >> >> > >> >> IIUC, now, we only scan swap_map[] only if > >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_cluster= s[order]). > >> >> That is, if you doesn't run low swap free space, you will not do th= at. > >> > > >> > You can still swap space in order 0 clusters while order 4 runs out = of > >> > free_cluster > >> > or nonfull_clusters[order]. For Android that is a common case. > >> > >> When we fail to allocate order 4, we will fallback to order 0. Still > >> don't need to scan swap_map[]. But after looking at your below reply,= I > >> realized that the swap space is almost full at most times in your case= s. > >> Then, it's possible that we run into scanning swap_map[]. > >> list_empty(&si->free_clusters) && > >> list_empty(&si->nonfull_clusters[order]) will become true, if we put t= oo > >> many clusters in si->percpu_cluster. So, if we want to avoid to scan > >> swap_map[], we can stop add clusters in si->percpu_cluster when swap > >> space runs low. And maybe take clusters out of si->percpu_cluster > >> sometimes. > > > > One idea after reading your reply. If we run out of the > > nonfull_cluster[order], we should be able to use other cpu's > > si->percpu_cluster[] as well. That is a very small win for Android, > > This will be useful in general. The number CPU may be large, and > multiple orders may be used. > > > because android does not have too many cpu. We are talking about a > > handful of clusters, which might not justify the code complexity. It > > does not change the behavior that order 0 can pollut higher order. > > I have a feeling that you don't really know why swap_map[] is scanned. > I suggest you to do more test and tracing to find out the reason. I > suspect that there are some non-full cluster collection issues. > > >> Another issue is nonfull_cluster[order1] cannot be used for > >> nonfull_cluster[order2]. In definition, we should not fail order 0 > >> allocation, we need to steal nonfull_cluster[order>0] for order 0 > >> allocation. This can avoid to scan swap_map[] too. This may be not > >> perfect, but it is the simplest first step implementation. You can > >> optimize based on it further. > > > > Yes, that is listed as the limitation of this cluster order approach. > > Initially we need to support one order well first. We might choose > > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages > > are too big. The sweet spot might be some there in between. If we can > > support one order well, we can demonstrate the value of the mTHP. We > > can worry about other mix orders later. > > > > Do you have any suggestions for how to prevent the order 0 polluting > > the higher order cluster? If we allow that to happen, then it defeats > > the goal of being able to allocate higher order swap entries. The > > tricky question is we don't know how much swap space we should reserve > > for each order. We can always break higher order clusters to lower > > order, but can't do the reserves. The current patch series lets the > > actual usage determine the percentage of the cluster for each order. > > However that seems not enough for the test case Barry has. When the > > app gets OOM kill that is where a large swing of order 0 swap will > > show up and not enough higher order usage for the brief moment. The > > order 0 swap entry will pollute the high order cluster. We are > > currently debating a "knob" to be able to reserve a certain % of swap > > space for a certain order. Those reservations will be guaranteed and > > order 0 swap entry can't pollute them even when it runs out of swap > > space. That can make the mTHP at least usable for the Android case. > > IMO, the bottom line is that order-0 allocation is the first class > citizen, we must keep it optimized. And, OOM with free swap space isn't > acceptable. Please consider the policy we used for page allocation. > > > Do you see another way to protect the high order cluster polluted by > > lower order one? > > If we use high-order page allocation as reference, we need something > like compaction to guarantee high-order allocation finally. But we are > too far from that. > > For specific configuration, I believe that we can get reasonable > high-order swap entry allocation success rate for specific use cases. > For example, if we only do limited maximum number order-0 swap entries > allocation, can we keep high-order clusters? Isn't limiting order-0 allocation breaks the bottom line that order-0 allocation is the first class citizen, and should not fail if there is space? Just my two cents... I had a try locally based on Chris's work, allowing order 0 to use nonfull_clusters as Ying has suggested, and starting with low order and increase the order until nonfull_cluster[order] is not empty, that way higher order is just better protected, because unless we ran out of free_cluster and nonfull_cluster, direct scan won't happen. More concretely, I applied the following changes, which didn't change the code much: - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then free_clusters, then discard_cluster. - If it's order 0, also check for (int i =3D 0; i < SWAP_NR_ORDERS; ++i) nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster returns false. A quick test still using the memtier test, but decreased the swap device size from 10G to 8g for higher pressure. Before: hugepages-32kB/stats/swpout:34013 hugepages-32kB/stats/swpout_fallback:266 hugepages-512kB/stats/swpout:0 hugepages-512kB/stats/swpout_fallback:77 hugepages-2048kB/stats/swpout:0 hugepages-2048kB/stats/swpout_fallback:1 hugepages-1024kB/stats/swpout:0 hugepages-1024kB/stats/swpout_fallback:0 hugepages-64kB/stats/swpout:35088 hugepages-64kB/stats/swpout_fallback:66 hugepages-16kB/stats/swpout:31848 hugepages-16kB/stats/swpout_fallback:402 hugepages-256kB/stats/swpout:390 hugepages-256kB/stats/swpout_fallback:7244 hugepages-128kB/stats/swpout:28573 hugepages-128kB/stats/swpout_fallback:474 After: hugepages-32kB/stats/swpout:31448 hugepages-32kB/stats/swpout_fallback:3354 hugepages-512kB/stats/swpout:30 hugepages-512kB/stats/swpout_fallback:33 hugepages-2048kB/stats/swpout:2 hugepages-2048kB/stats/swpout_fallback:0 hugepages-1024kB/stats/swpout:0 hugepages-1024kB/stats/swpout_fallback:0 hugepages-64kB/stats/swpout:31255 hugepages-64kB/stats/swpout_fallback:3112 hugepages-16kB/stats/swpout:29931 hugepages-16kB/stats/swpout_fallback:3397 hugepages-256kB/stats/swpout:5223 hugepages-256kB/stats/swpout_fallback:2351 hugepages-128kB/stats/swpout:25600 hugepages-128kB/stats/swpout_fallback:2194 High order (256k) swapout rate are significantly higher, 512k is now possible, which indicate high orders are better protected, lower orders are sacrificed but seems worth it.