From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B7C73D66B9A for ; Wed, 27 Nov 2024 00:09:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E19956B0083; Tue, 26 Nov 2024 19:09:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DA2576B0085; Tue, 26 Nov 2024 19:09:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C43E06B0088; Tue, 26 Nov 2024 19:09:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 9E5836B0083 for ; Tue, 26 Nov 2024 19:09:11 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 28EB51C84E5 for ; Wed, 27 Nov 2024 00:09:11 +0000 (UTC) X-FDA: 82829939934.21.3B3FD45 Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf28.hostedemail.com (Postfix) with ESMTP id 280A0C0007 for ; Wed, 27 Nov 2024 00:09:00 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=G5bi6tWz; spf=pass (imf28.hostedemail.com: domain of chrisl@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732666145; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=52afBux2Qq2+8ozCDisT+znnj5ky1SkX2eREwVqJCxQ=; b=UUMeBl3+VZ7IBmndKeiXQynJCimchRyFlUnKg2SYDqIyee2ZNGvkwAFukT4muYYXcpaG3W eUqY+mFxZi8uOIVScSYimkFOgb3tJBZDlHxGa0QeIOHgUpUXKB9c1luFn56geVJee0VCKG IlB+vPFxNUjzgyn+t4dZ/JZtpAlTQ1s= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=G5bi6tWz; spf=pass (imf28.hostedemail.com: domain of chrisl@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732666145; a=rsa-sha256; cv=none; b=mXj164+IpvkiO0mH2PdRM4pnQw/RG8qNzs02j/VxIMG67UuZ6SnmmQB1RxUzNbASYArnnI 1/yJiKOCbJYOhmBPmALK0+96/rQ1jTZNWNXrT57w9XHfLwpl51wBBKe6XSoogNA3edVPld 5FiLIxtaeufHIQUe8u6rMQLoT4D4GrI= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id 76C22A41756 for ; Wed, 27 Nov 2024 00:07:15 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0E696C4CED0 for ; Wed, 27 Nov 2024 00:09:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1732666148; bh=/CdDLGoJ1M8kEeGszX5uJUw35m3jeZUaggvfENDOBUs=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=G5bi6tWzX4ZuvkzJiRiNSlRty24faSfQ7HOtuh6syVJKVjnbOyhsHbp1bV7UMKnRs a1+9vY8mzTuqchiyV/EHO/9CVZQzCgJPNztwwKH7xAYkG1h3l9GXzTDvH/ztFJjZJ8 ULNJP4ArfM5L6XSih2W7UQ0i9kQwysy4yLH+NmIuu13iQXmG5RrVzIzpzHIvZGByzE 4I0CUJZ8FXhCXz6EHSA2fWTBmlcuLaWch1hxUk/j6Zq2VC2v96Y0csNIdDAvxv/OYS alOsc9BAjkFZ6EM2Ap6AbOwTqeyJR/kg/s+cW2NzS2IDw/n0NFq9qkTvNX2SxEyN9b 4KEFv+vVP/l3A== Received: by mail-yw1-f179.google.com with SMTP id 00721157ae682-6eebb54fc48so55946487b3.1 for ; Tue, 26 Nov 2024 16:09:08 -0800 (PST) X-Forwarded-Encrypted: i=1; AJvYcCUvtQPoN/pxdeFbzYsjthirWHtMYMRwlwfdh+4RDzuxNMlJ1gZQAlC6cuuC3Pjy9Qi2tAnkyI1Jkw==@kvack.org X-Gm-Message-State: AOJu0Yx5iLeYuQCx5ir/PqD2ExZlIURQacrqSRBmKMvzJtSHc3O1MMkZ jyr9FTearzSwThg/29v8OO7h4W5KvnaMYyxWFrESUKpkRI6WkiqOOPVwyIg3cfnpwdDT48PDN8t cDkNKjqSHDxx/sWaEDhdDmcTdTae4J68WmSc2Og== X-Google-Smtp-Source: AGHT+IHqcVYjU790T8D/asBW13GT8ImEB95KZ3XVSKArXMRm/5ZayDypabv/Zo+v4QJYaFWkdMfkNsrTf+s6/9a+qmA= X-Received: by 2002:a05:690c:6e03:b0:6ee:b7cf:4a8f with SMTP id 00721157ae682-6ef37282b3dmr14585947b3.38.1732666147348; Tue, 26 Nov 2024 16:09:07 -0800 (PST) MIME-Version: 1.0 References: <20241116091658.1983491-1-chenridong@huaweicloud.com> <20241116091658.1983491-2-chenridong@huaweicloud.com> In-Reply-To: From: Chris Li Date: Tue, 26 Nov 2024 16:08:56 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the tail of LRU after shrinking To: Matthew Wilcox Cc: Barry Song <21cnbao@gmail.com>, Chen Ridong , akpm@linux-foundation.org, mhocko@suse.com, hannes@cmpxchg.org, yosryahmed@google.com, yuzhao@google.com, david@redhat.com, ryan.roberts@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, chenridong@huawei.com, wangweiyang2@huawei.com, xieym_ict@hotmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 280A0C0007 X-Rspamd-Server: rspam12 X-Stat-Signature: k5zmoobhkcz99c831khsfy4nwffzbep1 X-Rspam-User: X-HE-Tag: 1732666140-900 X-HE-Meta: U2FsdGVkX19LOEf0I68BT9SJKrWKiFLUfuOLLmX6dY9MQVAkB3AcqDhDBGFrv+FAVFytLAB6BwJML5cpMbfQAI6pOL+60Zl+gV+rY5ClNx4hzW5ey3RiJZ5FFHdz/y3rum3tF6PpkIyZZCqaFY1JBnu0gFsC9DYMlul94JNtN3a+F0ys9YnOJMrHzqlaYeKxOJ58QiRTE8HFsZ5XLPS+NUlkeF1zWzxlZdwiv81r+is59JO53MWWYoOsXnn155YWHxzmQrhNbH2XojyMpHv5VfTFduhxeG+CVktJ4s/KB/KhkqCj1y2s+nQPLihTauzRrY5dFm/NxwN6i8CyECbnrB28hUxnE1gjh9W1Fs2UOC1fNn2iud8BlKdtwLNsjq6wWTj+SPK4YHLuIrPJ+duRi/zhnK/K4VrMe9jObPltyhefsvbmVmv+a7N0g2pq9QRLQcdzA1H8X7/85WMu7rcUXZZR7G5Ta0WitVmWNgevhCt0xiGDwxyQl4J1TkpwYEQGU0hnZVnHjbB+5m2RRrrRY+sIxiaEGpTIUnPpjzYcRhUDAS1S5X6SVfm/vVEv6QF5BZHw3caDcvgF4y1FOX/yK82Ug1QpG99l3C4pCF6WaQIQc/9C4UJ+jzlFR5RfRGFrvKEXTwfhgYVigezWMs+Hv/3kkmedypMGZAIA1mcdc4pFs5q3W2ANLrCAYI6Kv7LV7L0NhXguSlD7RQ2FC+azSWdf7czajjjp1osP59TyqxdGwmGllxPO3xGJsznhVtRjP9kILcCWCgDQ1XRzsWVoY3LNYRdFSmYLp3oP2L16KpZGZ1ckePMIXchKjCLFuOh+Bv544G2pLi8K9wLCzzt5F9Fr4vMT46fMkkP09EjMBY8Jxqjyf1kv3Xiokbu7EX6GosaSPx0mphJyli9f1vCOEjjKI1z5+ZFYb3ZTt4Pssoa1Naecrnm0yOAesjtgRAOWmXwKgoyK60g44N1dIHc +nB9BDNi 4nz2JTLxiUPWiptUhQo/5Q8dPjNP/AUUuBGnm6pYVLLGRrZr1LjVsV9xBre3vc+O2gIQHWi3gNN3quHOiWaZjC2UC05WM/W0AKXeXCgmeG+TQyF5ii9TltanH51TqRZ6nCJZYIUYplfwOCB6zPL/EsjPJ9Kd+bWlsH8aNV6lFXoBP7A0MeIhAZHjVybDM16upt9HRfe+yfBN4uR5PKZtkPd5KnApYeKlbiBnEkP5V0l6wZWpUOx2HJhGwjUuPv1iv3tFxcaxHgUmLXTd0gOCNSzqOG0stAqFST9WUqPlY0fKxvAJJL1Or0YWq+g09aTOHWZBp1yp2CIbAyyc0qmmIn2fydQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Nov 17, 2024 at 8:22=E2=80=AFPM Matthew Wilcox wrote: > > On Mon, Nov 18, 2024 at 05:14:14PM +1300, Barry Song wrote: > > On Mon, Nov 18, 2024 at 5:03=E2=80=AFPM Matthew Wilcox wrote: > > > > > > On Sat, Nov 16, 2024 at 09:16:58AM +0000, Chen Ridong wrote: > > > > 2. In shrink_page_list function, if folioN is THP(2M), it may be sp= lited > > > > and added to swap cache folio by folio. After adding to swap cac= he, > > > > it will submit io to writeback folio to swap, which is asynchron= ous. > > > > When shrink_page_list is finished, the isolated folios list will= be > > > > moved back to the head of inactive lru. The inactive lru may jus= t look > > > > like this, with 512 filioes have been move to the head of inacti= ve lru. > > > > > > I was hoping that we'd be able to stop splitting the folio when addin= g > > > to the swap cache. Ideally. we'd add the whole 2MB and write it back > > > as a single unit. > > > > This is already the case: adding to the swapcache doesn=E2=80=99t requi= re splitting > > THPs, but failing to allocate 2MB of contiguous swap slots will. > > Agreed we need to understand why this is happening. As I've said a few > times now, we need to stop requiring contiguity. Real filesystems don't > need the contiguity (they become less efficient, but they can scatter a > single 2MB folio to multiple places). > > Maybe Chris has a solution to this in the works? Hi Matthew and Chenridong, Sorry for the late reply. I don't have a working solution yet. I just have some ideas. One of the big challenges is what to do with swap cache. Currently when a folio was added to the swap cache, it assumed continued swap entry. There will be a lot of complexity to break that assumption. To make things worse, the discontiguous swap entry might belong to a different xarray due to the 64M swap address sharding. One idea is that we can have a special kind of swap device to do swap entry redirecting. For the swap out path, Let's say the real swapfile A is almost full. We want to allocate an order of 4 swap entries to folio F. If there are contiguous swap entries in A, the swap allocator just returns entry [A9 ..A12], with A9 as the head swap entry. That is the same as the normal path we have now. On the other hand, if there is no contiguous swap entry in A. Only non-contiguous swap entry A1, A3, A5, A7. Instead, we allocate from a special redirecting swap device R as R1, R2, R3, R4 with an IO redirecting array as [R1, A1, A3, A5, A7]. Swap device R is virtual, there is no real file backing on it, so the swap file size on R can grow or shrink as needed. In add_to_swap_cache(), we set folio F->swap =3D R1. Add F into swap cache S with entry [R1..R4] pointing to folio F. In other words, S[R1..R4] =3D F. Add additional lookup xarray L[R1..R4] =3D [R1, A1, A3, A5, A7]. For the rest of the code, we pass the R1 as the continuous swap entry to folio F. The swap_writepage_bdev_async() will recognize R as a special device. It will do the lookup xarray L[R1] to get the [R1, A1, A3, A5, A7], use that entry list to build the bio with 4 iovec instead of 1. Fill up the [A1,A3,A5,A7] into the bio vec. That is the swap write path. For the swap in, the page fault handler gets a fault at address X and looks up the pte containing swap entry R3. Look up the swap cache of S[R3] and get nothing, folio F is not in the swap cache. Recognize the R is a remapping device. The swap core will lookup L[R3] =3D [R1, A1,A3,A5,A7]. If we want to swap in order 2 folio. Then construct swap_read_folio_bdev_async() with iovec [A1, A3, A5, A7]. If we just want to swap in a 4k page. We can construct iovec as [A3] alone, given the swap entry starts from R1. That is the read path. For the simplicity, there is a lot of detail omitted in the description. Also on the implementation side, a lot of optimizations we might be able to do, e.g. using pointer lookup of R1 instead of xarray, we can use struct to hold R1 and [A1, A3, A5, A7] etc. This approach avoids a lot of complexity in breaking the continuity assumption of swap cache entries, at the cost of additional swap cache address space R. The lookup mapping L[R1..R4] =3D [R1, A1, A3, A5, A7] are minimally necessary data structures to track the IO remapping. I think that is unavoidable. Please let me know if you see any problem with the above approach. As always, feedback is welcome as well. Thanks Chris