From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D67DBC54E69 for ; Fri, 15 Mar 2024 10:02:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4E51C80112; Fri, 15 Mar 2024 06:02:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 47390800B4; Fri, 15 Mar 2024 06:02:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 24C3B80112; Fri, 15 Mar 2024 06:02:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 0B111800B4 for ; Fri, 15 Mar 2024 06:02:01 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 7942FA10BF for ; Fri, 15 Mar 2024 10:02:00 +0000 (UTC) X-FDA: 81898832400.28.18D094D Received: from mail-vk1-f174.google.com (mail-vk1-f174.google.com [209.85.221.174]) by imf13.hostedemail.com (Postfix) with ESMTP id A7D9D2001E for ; Fri, 15 Mar 2024 10:01:58 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Jgv7m0fF; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf13.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.174 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710496918; a=rsa-sha256; cv=none; b=in11iJvRGE5T12Bk2UlhnAsza7tJdrrqk6X+DRUYPBDcoKQ0mqmcoUpg2h/LzS2AT4ZBj+ BCJPvFlsVWJZp24FTfQSMoRXNE2Nm+s/sw4GjvBoaYjq8+h2fAHETXWWzlya3wIwr0RUF4 r/77zX/LLCOrReaalu/H7CRolpdRXGs= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Jgv7m0fF; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf13.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.174 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710496918; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oW5B6DxEkkaQLiJ4zUKDO9JEXVV6ZYR5z/sAge1VQ20=; b=W7HXilglMHS5HOLNcZf+dMEGFvpirxE6uwwMz57fxc3+fW48KyBJuv02ud+MHkeWgIZ86A bdmGAtfJR0e5gM9yzCqlJobNXEdj4pXmWFerEenj0pcy47YnjY1DNviblFIrKUoBjKd3R8 TQl4ODyLYCnK+3ol2vkdAsAh2/x4XIA= Received: by mail-vk1-f174.google.com with SMTP id 71dfb90a1353d-4d40b0d7223so798286e0c.0 for ; Fri, 15 Mar 2024 03:01:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1710496918; x=1711101718; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=oW5B6DxEkkaQLiJ4zUKDO9JEXVV6ZYR5z/sAge1VQ20=; b=Jgv7m0fF41fcXJku89jZ+k6kuyaCKiTXJ2Ps/P64Xk3YRQ8K3bqbF1Qe0ZLWnepyYh vMnxmSkofwRa74J0DfY0VYd/DcPSRTCZHIlh9CyYqb1NfNGXRW8J+11kE1mZwtLGhNAR G5JI/2qxOP661w+GwL2cj+aP7jxMuccNpwa13wypuG/J0T946QwmySWcyJbmK9bggWL/ FZDzsZ2io+u/PegJssXX/3uw/VeQ09xpFrWxzZ1/Mt4FEvdfPAN4XOz9UdfJ/A+haETV Fx5uKBlwFdE3lc5S8p/b9kNwYxIfb5BCXSc6hMGQOe69KSIhtGNzp5beMCxgV0VOanm8 fP+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710496918; x=1711101718; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=oW5B6DxEkkaQLiJ4zUKDO9JEXVV6ZYR5z/sAge1VQ20=; b=qUE5e7rE82mjaGRlSc8Me4IdelY9ElcAtgpKtGafHSayAicho/R0K+hYrH6CtW0DlC Wk4hoBtJqJS2ZueLfjSgWhW9dO6RZvzYWdIeGWtry9CEHyrxjjCjb+WmqAvwivR+L87c U5a9+R3flHFcREAfV7WZQASgV6CJ27uNdWkFhS05+WKygME9rY2aegYrKyzsEJKTuqaV 5RD8scA12A57O9FNoYYLl+V/xq1AxXbpcAAcaCmPNWaz45CGB6gxRlDvlTPgJySpHGDq 9B/emkPnFvsYSqsoS4RFweZPaHCrLp5F7/erXA24Gzsyog0JVsVfxTagDQDr4oLBr/1Q G+GQ== X-Forwarded-Encrypted: i=1; AJvYcCX0SOvhLKKqMp8yDBbmuVZJ8HApdyJS3kuen+/vcdHIdkpu2LIJ+QJCP7LcCg/BSVL0zYjLfokyvJ9liw7fwQSM+hE= X-Gm-Message-State: AOJu0YwmbdW43EzdzvOY8imM6slpUJ3D9OftSP6UVBz3WsKUe/tJTASP w3WQ6hU3EQQl2msdl1DQCkYZmwF2ebXpqtEVj4yBmjn3P2qssM4w4Kqc9AwdomiAv1vA1cok1Zu OC74KlpG9rtTPQyD5Rw55bqnEy5o= X-Google-Smtp-Source: AGHT+IEnvN8kiTCclBTIT/7XiLqTI34YVzLDPfLghUXBNqtpvqHWbScB5k87v9w6zXKxGQAU7WWOajS0yZyvg3m8Xgw= X-Received: by 2002:a1f:eec7:0:b0:4d4:1fe2:c398 with SMTP id m190-20020a1feec7000000b004d41fe2c398mr3511026vkh.2.1710496917602; Fri, 15 Mar 2024 03:01:57 -0700 (PDT) MIME-Version: 1.0 References: <20240304081348.197341-1-21cnbao@gmail.com> <20240304081348.197341-6-21cnbao@gmail.com> <87wmq3yji6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Barry Song <21cnbao@gmail.com> Date: Fri, 15 Mar 2024 23:01:46 +1300 Message-ID: Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole To: "Huang, Ying" Cc: Matthew Wilcox , akpm@linux-foundation.org, linux-mm@kvack.org, ryan.roberts@arm.com, chengming.zhou@linux.dev, chrisl@kernel.org, david@redhat.com, hannes@cmpxchg.org, kasong@tencent.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, mhocko@suse.com, nphamcs@gmail.com, shy828301@gmail.com, steven.price@arm.com, surenb@google.com, wangkefeng.wang@huawei.com, xiang@kernel.org, yosryahmed@google.com, yuzhao@google.com, Chuanhua Han , Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: A7D9D2001E X-Stat-Signature: sp9pinqq8gaapbgpu84ek6ysibzzmq55 X-HE-Tag: 1710496918-127646 X-HE-Meta: U2FsdGVkX1/CGyjJ43cY3pR9kXUgca7OqcIb1bu4b93ta2Zzc0ksrsPLEYE38A5OHwlABnJyojB4BzLZZ9x3j4ARVOHQxPUZhTAezvjh1VbeUr1P5nSjCeRD1gm0vReoKRPSNBENyZ+NZvppN1y1/ykYnPsJPWiJVWdvqR7c/M0y+M4bG1+JbFxlsF9mn735JH9X/8QSibsP9eYNVoYiR8fArl7JwFFPUlm0/U6+z+BiMKQ6A4AxfPUZLUQdiE4wmnvYj+2SxR3KUkIHr18u+ep2PqCMGfk58gbFaGcEWod/PaK2kCFRkBDIGuP+ZW8oPKQBX1CnBZqLgIK1xx11LTNSPajQZoV8APhZHqn0R73d3uBEH/1hLFs9ITMjCi8F9QIUWHGuYS9THFv5GohKYpwFzCdszXK/WgdtCXASF72N/SeLU4gOxm//4y+Jvm2OGSzuRXbSPgkMwkU//PfNKtqHJPgETQ8DXovp7FnngEAJf8uIRwWpOQleWxx7V6+XwuiNOhP226F04h8Ni+8W+pEsFIMNgFFXGx4hA6LkQ2mg1q3NJl4YbdQRcz6Jm8kG17gviu8WhJgkvHdTkT/y2x96MOzCLmKfuqzdtLJ7leYev/MV+i8MPWQsOdoAyNnt081xfcPOfImHEROfBt1iD+afjo4Jp1xAvpf7n9iSXXM4Uc/kJypYMbzpsx8Zwsw4yfxG4FwD5nFgQFSpNL1uhWLDcMIRoj+Z+q+qDU635hg0vxoDVN2XGpwdWWv3xlaip/zKd39ImyNbjiotm3Bb1BEeaDHDdPuYASDrIirVUYUKQBg+wqSpFfoDuof/PcPfG26Afp6tJdnY25tEwI/J85+T4c9YUKnBSiupqZ8ALW7+k3IBU93G2bw7oHhfVrYVcN3nhOaClmCF2AVg3qFU0P9BdfHvXYVb9e0BUUmpYQi4GN4EN0zam/qey6vLHwzkqTgFH4d0M2Tq457mAXi VUaWBqfb /8/0pu6kT+fFwMkJJWRoQ7dj6077LPoc8gJH5Z/ELprJlcmxDI43Ll2rLUxPs54Rxpj6lhnh5MhvHygNVAHh9LTm+IeX9mraRu6VKtHNzxhHDGDMB8EB3IPmWK0p3xrypZ2I0bg8p/f4jvb5QVTBo6+0QGeDfuukxYYzD/TiW+1XqnkkWGSFLHVovT6pWz2Cdg6g1/DdLKRqw3f5iFUqdxpEx1zZeW4IWsKeFwJTgGY9nP+VtVoD02ClMvgGJvLT/ihf/Y+nTzfH7WuO5UrOrf+iDmF58K337bSExThZiDB5e0/7b64vneZJciQxDp16qfo6BFGrEtXfI08y/FdOYXnak0zs3HmmkQvtuvvdbbj9FCQmbFlIxegtVSp5NrWEbOg2POqdlO9CARiSxbZcOF1m+HaP0AH/2k6aJN2d+GDeSASu8qJ5yKA5jYRToIuu15B0OJHwOZ1mtVpioDAvKBK+7Ljj0F0srd8QFDKd6tc8+UESx9cFhj3LGuk6Ir2vmQ/Dm X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 15, 2024 at 10:17=E2=80=AFPM Huang, Ying = wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Fri, Mar 15, 2024 at 9:43=E2=80=AFPM Huang, Ying wrote: > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > From: Chuanhua Han > >> > > >> > On an embedded system like Android, more than half of anon memory is > >> > actually in swap devices such as zRAM. For example, while an app is > >> > switched to background, its most memory might be swapped-out. > >> > > >> > Now we have mTHP features, unfortunately, if we don't support large = folios > >> > swap-in, once those large folios are swapped-out, we immediately los= e the > >> > performance gain we can get through large folios and hardware optimi= zation > >> > such as CONT-PTE. > >> > > >> > This patch brings up mTHP swap-in support. Right now, we limit mTHP = swap-in > >> > to those contiguous swaps which were likely swapped out from mTHP as= a > >> > whole. > >> > > >> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOU= S > >> > case. It doesn't support swapin_readahead as large folios yet since = this > >> > kind of shared memory is much less than memory mapped by single proc= ess. > >> > >> In contrast, I still think that it's better to start with normal swap-= in > >> path, then expand to SWAP_SYCHRONOUS case. > > > > I'd rather try the reverse direction as non-sync anon memory is only ar= ound > > 3% in a phone as my observation. > > Phone is not the only platform that Linux is running on. I suppose it's generally true that forked shared anonymous pages only constitute a small portion of all anonymous pages. The majority of anonymous pages are w= ithin a single process. I agree phones are not the only platform. But Rome wasn't built in a day. I can only get started on a hardware which I can easily reach and have enough hardware/tes= t resources on it. So we may take the first step which can be applied on a real product and improve its performance, and step by step, we broaden it and make it widely useful to various areas in which I can't reach :-) so probably we can have a sysfs "enable" entry with default "n" or have a maximum swap-in order as Ryan's suggestion [1] at the beginning, " So in the common case, swap-in will pull in the same size of folio as was swapped-out. Is that definitely the right policy for all folio sizes? Certa= inly it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not = sure it makes sense for 2M THP; As the size increases the chances of actually ne= eding all of the folio reduces so chances are we are wasting IO. There are simila= r arguments for CoW, where we currently copy 1 page per fault - it probably m= akes sense to copy the whole folio up to a certain size. " > > >> > >> In normal swap-in path, we can take advantage of swap readahead > >> information to determine the swapped-in large folio order. That is, i= f > >> the return value of swapin_nr_pages() > 1, then we can try to allocate > >> and swapin a large folio. > > > > I am not quite sure we still need to depend on this. in do_anon_page, > > we have broken the assumption and allocated a large folio directly. > > I don't think that we have a sophisticated policy to allocate large > folio. Large folio could waste memory for some workloads, so I think > that it's a good idea to allocate large folio always. i agree, but we still have the below check just like do_anon_page() has it, orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, false, true= , true, BIT(PMD_ORDER) - 1); orders =3D thp_vma_suitable_orders(vma, vmf->address, orders); in do_anon_page, we don't worry about the waste so much, the same logic also applies to do_swap_page(). > > Readahead gives us an opportunity to play with the policy. I feel somehow the rules of the game have changed with an upper limit for swap-in size. for example, if the upper limit is 4 order, it limits folio size to 64KiB which is still a proper size for ARM64 whose base page can be 64KiB. on the other hand, while swapping out large folios, we will always compress them as a whole(zsmalloc/zram patch will come in a couple of days), if we choose to decompress a subpage instead of a large folio in do_swap_page(), we might need to decompress nr_pages times. for example, For large folios 16*4KiB, they are saved as a large object in zsmalloc(with the coming patch), if we swap in a small folio, we decompress the large object; next time, we will still need to decompress a large object. so it is more sensible to swap in a large folio if we find those swap entries are contiguous and were allocated by a large folio swap-out. > > > On the other hand, compressing/decompressing large folios as a > > whole rather than doing it one by one can save a large percent of > > CPUs and provide a much lower compression ratio. With a hardware > > accelerator, this is even faster. > > I am not against to support large folio for compressing/decompressing. > > I just suggest to do that later, after we play with normal swap-in. > SWAP_SYCHRONOUS related swap-in code is an optimization based on normal > swap. So, it seems natural to support large folio swap-in for normal > swap-in firstly. I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal" than the swapcache path since it is the majority. and on the other hand, a = lot of modification is required for the swapcache path. in OPPO's code[1], we d= id bring-up both paths, but the swapcache path is much much more complicated than the SYNC path and hasn't really noticeable improvement. [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/oneplu= s/sm8650_u_14.0.0_oneplus12 > > > So I'd rather more aggressively get large folios swap-in involved > > than depending on readahead. > > We can take advantage of readahead algorithm in SWAP_SYCHRONOUS > optimization too. The sub-pages that is not accessed by page fault can > be treated as readahead. I think that is a better policy than > allocating large folio always. Considering the zsmalloc optimization, it would be a better choice to always allocate large folios if we find those swap entries are for a swapped-out large foli= o. as decompressing just once, we get all subpages. Some hardware accelerators are even able to decompress a large folio with multi-hardware threads, for example, 16 hardware threads can compress each subpage of a large folio at the same time, it is just as fast as decompressing one subpage. for platforms without the above optimizations, a proper upper limit will help them disable the large folios swap-in or decrease the impact. For example, if the upper limit is 0-order, we are just removing this patchset. if the upper limit is 2 orders, we are just like BASE_PAGE size is 16KiB. > > >> > >> To do that, we need to track whether the sub-pages are accessed. I > >> guess we need that information for large file folio readahead too. > >> > >> Hi, Matthew, > >> > >> Can you help us on tracking whether the sub-pages of a readahead large > >> folio has been accessed? > >> > >> > Right now, we are re-faulting large folios which are still in swapca= che as a > >> > whole, this can effectively decrease extra loops and early-exitings = which we > >> > have increased in arch_swap_restore() while supporting MTE restore f= or folios > >> > rather than page. On the other hand, it can also decrease do_swap_pa= ge as > >> > PTEs used to be set one by one even we hit a large folio in swapcach= e. > >> > > >> > -- > Best Regards, > Huang, Ying Thanks Barry