From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B702C54E67 for ; Mon, 18 Mar 2024 02:41:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8CA316B0085; Sun, 17 Mar 2024 22:41:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 87A406B0087; Sun, 17 Mar 2024 22:41:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 71B8A6B0088; Sun, 17 Mar 2024 22:41:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 5BB366B0085 for ; Sun, 17 Mar 2024 22:41:41 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 2486DA0BB2 for ; Mon, 18 Mar 2024 02:41:41 +0000 (UTC) X-FDA: 81908609202.07.85A1AEB Received: from mail-vk1-f178.google.com (mail-vk1-f178.google.com [209.85.221.178]) by imf24.hostedemail.com (Postfix) with ESMTP id 7A8B418000C for ; Mon, 18 Mar 2024 02:41:39 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=RIfQ4g6b; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.178 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710729699; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4Q1ZsxcEK4oCrb2hk8jSYbnqg03nST/a1q8xFRqCISY=; b=ZnhK0+jqkWnZfbeUYWxug9JPVhpHbo14t2gEIHhulfEuyKUXKC0aYoU1pWK+77WCzm9VDR e4rsyzUGWa4/XkR9aOFfeHDMhqVotPvHsSHngFDO+PHZ1zLibzgbH026ZiEk1A6IoA7SPy +MaKHOJ5XiYi6iIQlXTCOTK+2Yh4C/0= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=RIfQ4g6b; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.178 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710729699; a=rsa-sha256; cv=none; b=iseZhiENkT5JCoYei9yOzPx5tndY7cHkInySi9Cuz6bYeZGfVsJutb4CTyaq9qQiSsPLna 0cKcyDnH+RGLVQlp/ltvONzjj5VmtUnFXzyAKqxuH8XlfxZ/d0RMmf6dR7EOV7YRAX7HhY vayQbk4Hq5Wj8XFQ4VsB0gjco1mKH5w= Received: by mail-vk1-f178.google.com with SMTP id 71dfb90a1353d-4affeacaff9so1015788e0c.3 for ; Sun, 17 Mar 2024 19:41:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1710729698; x=1711334498; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4Q1ZsxcEK4oCrb2hk8jSYbnqg03nST/a1q8xFRqCISY=; b=RIfQ4g6bDjV0bDWc7BFjrDLaZNn78vCsWUdiitAJoi3ex0I1SOCzkaGLPgvKacwMnp xK+OovbMlbSCdV1KJEO89Fiv139Jjj8wn29rQ415USkcNGFW8hP5onK7vlKJx9D7e7T9 Rg5BDflS+W9faCZg9jY5xXblcOslygqBXBdDibN4Vi4G85OnjTcoShjMohaREZvSg0vb H1HUnwHCoVyBIlX9Zuz8UXgebcQS/XUPVYHwY3DF/uyWA//WrkNQ3z6FdT21zJoTHKmv MbxiKVrbg41JSaOEQOxPFonLG+04MM/rTyNcPjG3baURWJ+Gchwi+tIuWSqPczC5+aKP 38Fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710729698; x=1711334498; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4Q1ZsxcEK4oCrb2hk8jSYbnqg03nST/a1q8xFRqCISY=; b=QT7lP4nw1N24j0hjf1jyPaLSiTf7CZeQMv2YZ2Zaiu5ZAqg+Ba/ojEpR4ZN3cmeikQ NG4gHktEkqG9nEMK+ED8d2D9ievyNr8PYwEQ13DbkUVEF45RMQQ6ilU4I+mK/ILkA7/f b5K9iijzNrwwgNxGjrVb6OtLuA6gxCjFj6SajTFWLWNk5fSweJRRNpFqMPvJEx8uE8SL FYBEwdqGnVbuiq9qGtR/6QmddGJbAmUpoADbU82XTmrXSb8oBPAfMFpTkIfxp2jHWVIG nYHVYUd/yeMXEbKuAkzeSSm1lLmCeKMYkbaXQ+QAL3GR1euTYA3HwE8Qhvuz1UXsLM6u Dgbw== X-Forwarded-Encrypted: i=1; AJvYcCUn7gZcmd1B/B/itdJyneaBjuTq0M/JzF0OaQU/TV/sPPA54tFdo6Q1oWM/pVf9ll9ZYVOHBDdraUuLbsyXuMy5lGU= X-Gm-Message-State: AOJu0YxUVlEMtU552eSDMTqMagUrSDXKpc4D43lEfkQvKMOjPxeXLOAM fZWytmuPtLokOBVAr5hRX8rG+NGutLUEdnKmhv2B1xt0mvqZgoxI+KpQ7nI9/CbI5j7k+RvUMQs nKmhi7o9EbOdtfO+xbqxURUvFCxk= X-Google-Smtp-Source: AGHT+IF/cfTemmJP51QzZN8AKCBvdK1MVobNPsXMTsFOmcj/sugYkGaCE47QnxDSHUbP4iVTGuS94CSnThJ7pfGSQrA= X-Received: by 2002:a05:6122:2314:b0:4c7:b048:bb9c with SMTP id bq20-20020a056122231400b004c7b048bb9cmr5377724vkb.15.1710729696898; Sun, 17 Mar 2024 19:41:36 -0700 (PDT) MIME-Version: 1.0 References: <20240304081348.197341-1-21cnbao@gmail.com> <20240304081348.197341-6-21cnbao@gmail.com> <87wmq3yji6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzm0wblq.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87jzm0wblq.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Barry Song <21cnbao@gmail.com> Date: Mon, 18 Mar 2024 15:41:25 +1300 Message-ID: Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole To: "Huang, Ying" Cc: Matthew Wilcox , akpm@linux-foundation.org, linux-mm@kvack.org, ryan.roberts@arm.com, chengming.zhou@linux.dev, chrisl@kernel.org, david@redhat.com, hannes@cmpxchg.org, kasong@tencent.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, mhocko@suse.com, nphamcs@gmail.com, shy828301@gmail.com, steven.price@arm.com, surenb@google.com, wangkefeng.wang@huawei.com, xiang@kernel.org, yosryahmed@google.com, yuzhao@google.com, Chuanhua Han , Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 7A8B418000C X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: unk8tie3ufz4sbarq9e1s8gwrjpjkb3q X-HE-Tag: 1710729699-641818 X-HE-Meta: U2FsdGVkX18YjbBYG64rjTTIKfXgVvb3P15BKZTwF4710KmAXdzwzeXZszhUJiCJPB6WR57b90t3rd8b7fl8hblRQJUjey0rDPQRVDppBh/pfcuq825vlJqS679rvsEva8Nn8+sM1NVVCWtdkdyogUt3zbhL0WvEdPIRmeik23zxIkcZumTMjr02pFDpa4VfPUzH63MXv8bAqEgIcq+Gnl9F8QcLq+LTfWGWN9e2hl6EdkLiKGMGmjnkjVjNVzksIcnB76OZ9xJkpSLrV9psPfVF7Z4v/E4C7kZ/zWD8x5KrG9Qw6nyg2GUWHltZchRfyYyc6T/x+0P8gDzj/1i7drQFmMl1BLCq+6ldtXeOE2tEkLlrbZJC97pJjSvRb3o7Ug7apBY4QqP55tQHltmYwBT9QWlJPGg0t4ivE9cexW+YI4CyRxhSxOcVSnvVJD9/CpLd/y/vnP+AXXfiocysiKNxA2xNAASK/f1iDs9V45dbkeY1jxMltLGNvp1LMgwBzqihuE4AIBWzPghlOt2B60Im8YplGSCpxbu2aalwV1mVZjjWIbLb5cW762rcRP9znA2mougG4FpgRK9mxOVXAuk2YgkeLoEHFQIIzjgfgZfhUhST1NBBu6OnvThG64qpgT+yifstQI7kwNuglxSdHd6Um87/E2sAiD9j12+qXybh/TxRSxYXjHiu8LkvZ05HR6m75xtM2hcDoyLjrWGuOvmn7Wgl5ZM2SWtFfLXl+FqlXrAfxIMUKLSM04FARodXZkiLR+DpQ4OnfUXFn+zDrOF/h3vwB95KJSJxsimD4OQP2FRbVSW9765aH+dtbcN7MURxMO8cQZu8LtWuEdKeh//3DMhhR2W+8Xr/ihBV/GpBwy6cQLaJc0rg+x1nSSWjNEd807iSUq9LhfqFTUCvUShRyxZOWuvFiJWcNOZdi1d9yrPNuDS5iCopKLFsts8TH/kh6jOz+HIms0d+vEH 8Tbx2QRC zf9UEwtl51HJUnISsAb5+qPC76opB3OZnLn2t X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 18, 2024 at 2:54=E2=80=AFPM Huang, Ying = wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Fri, Mar 15, 2024 at 10:17=E2=80=AFPM Huang, Ying wrote: > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > On Fri, Mar 15, 2024 at 9:43=E2=80=AFPM Huang, Ying wrote: > >> >> > >> >> Barry Song <21cnbao@gmail.com> writes: > >> >> > >> >> > From: Chuanhua Han > >> >> > > >> >> > On an embedded system like Android, more than half of anon memory= is > >> >> > actually in swap devices such as zRAM. For example, while an app = is > >> >> > switched to background, its most memory might be swapped-out. > >> >> > > >> >> > Now we have mTHP features, unfortunately, if we don't support lar= ge folios > >> >> > swap-in, once those large folios are swapped-out, we immediately = lose the > >> >> > performance gain we can get through large folios and hardware opt= imization > >> >> > such as CONT-PTE. > >> >> > > >> >> > This patch brings up mTHP swap-in support. Right now, we limit mT= HP swap-in > >> >> > to those contiguous swaps which were likely swapped out from mTHP= as a > >> >> > whole. > >> >> > > >> >> > Meanwhile, the current implementation only covers the SWAP_SYCHRO= NOUS > >> >> > case. It doesn't support swapin_readahead as large folios yet sin= ce this > >> >> > kind of shared memory is much less than memory mapped by single p= rocess. > >> >> > >> >> In contrast, I still think that it's better to start with normal sw= ap-in > >> >> path, then expand to SWAP_SYCHRONOUS case. > >> > > >> > I'd rather try the reverse direction as non-sync anon memory is only= around > >> > 3% in a phone as my observation. > >> > >> Phone is not the only platform that Linux is running on. > > > > I suppose it's generally true that forked shared anonymous pages only > > constitute a > > small portion of all anonymous pages. The majority of anonymous pages a= re within > > a single process. > > Yes. But IIUC, SWP_SYNCHRONOUS_IO is quite limited, they are set only > for memory backed swap devices. SWP_SYNCHRONOUS_IO is the most common case for embedded linux. note almost all Android/embedded devices use zRAM rather than a disk for swap. And we can have an upper limit order or a new control like /sys/kernel/mm/transparent_hugepage/hugepages-256kB/swapin and set them default to 0 first. > > > I agree phones are not the only platform. But Rome wasn't built in a > > day. I can only get > > started on a hardware which I can easily reach and have enough hardware= /test > > resources on it. So we may take the first step which can be applied on > > a real product > > and improve its performance, and step by step, we broaden it and make i= t > > widely useful to various areas in which I can't reach :-) > > We must guarantee the normal swap path runs correctly and has no > performance regression when developing SWP_SYNCHRONOUS_IO optimization. > So we have to put some effort on the normal path test anyway. > > > so probably we can have a sysfs "enable" entry with default "n" or > > have a maximum > > swap-in order as Ryan's suggestion [1] at the beginning, > > > > " > > So in the common case, swap-in will pull in the same size of folio as w= as > > swapped-out. Is that definitely the right policy for all folio sizes? C= ertainly > > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm = not sure > > it makes sense for 2M THP; As the size increases the chances of actuall= y needing > > all of the folio reduces so chances are we are wasting IO. There are si= milar > > arguments for CoW, where we currently copy 1 page per fault - it probab= ly makes > > sense to copy the whole folio up to a certain size. > > " > > > >> > >> >> > >> >> In normal swap-in path, we can take advantage of swap readahead > >> >> information to determine the swapped-in large folio order. That is= , if > >> >> the return value of swapin_nr_pages() > 1, then we can try to alloc= ate > >> >> and swapin a large folio. > >> > > >> > I am not quite sure we still need to depend on this. in do_anon_page= , > >> > we have broken the assumption and allocated a large folio directly. > >> > >> I don't think that we have a sophisticated policy to allocate large > >> folio. Large folio could waste memory for some workloads, so I think > >> that it's a good idea to allocate large folio always. > > > > i agree, but we still have the below check just like do_anon_page() has= it, > > > > orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, false, = true, true, > > BIT(PMD_ORDER) - 1); > > orders =3D thp_vma_suitable_orders(vma, vmf->address, orders); > > > > in do_anon_page, we don't worry about the waste so much, the same > > logic also applies to do_swap_page(). > > As I said, "readahead" may help us from application/user specific > configuration in most cases. It can be a starting point of "using mTHP > automatically when it helps and not cause many issues". I'd rather start from the simpler code path and really improve on phones & embedded linux which our team can really reach :-) > > >> > >> Readahead gives us an opportunity to play with the policy. > > > > I feel somehow the rules of the game have changed with an upper > > limit for swap-in size. for example, if the upper limit is 4 order, > > it limits folio size to 64KiB which is still a proper size for ARM64 > > whose base page can be 64KiB. > > > > on the other hand, while swapping out large folios, we will always > > compress them as a whole(zsmalloc/zram patch will come in a > > couple of days), if we choose to decompress a subpage instead of > > a large folio in do_swap_page(), we might need to decompress > > nr_pages times. for example, > > > > For large folios 16*4KiB, they are saved as a large object in zsmalloc(= with > > the coming patch), if we swap in a small folio, we decompress the large > > object; next time, we will still need to decompress a large object. so > > it is more sensible to swap in a large folio if we find those > > swap entries are contiguous and were allocated by a large folio swap-ou= t. > > I understand that there are some special requirements for ZRAM. But I > don't think it's a good idea to force the general code to fit the > requirements of a specific swap device too much. This is one of the > reasons that I think that we should start with normal swap devices, then > try to optimize for some specific devices. I agree. but we are having a good start. zRAM is not a specific device, it widely represents embedded Linux. > > >> > >> > On the other hand, compressing/decompressing large folios as a > >> > whole rather than doing it one by one can save a large percent of > >> > CPUs and provide a much lower compression ratio. With a hardware > >> > accelerator, this is even faster. > >> > >> I am not against to support large folio for compressing/decompressing. > >> > >> I just suggest to do that later, after we play with normal swap-in. > >> SWAP_SYCHRONOUS related swap-in code is an optimization based on norma= l > >> swap. So, it seems natural to support large folio swap-in for normal > >> swap-in firstly. > > > > I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal" > > than the swapcache path since it is the majority. > > I don't think so. Most PC and server systems uses !SWAP_SYCHRONOUS > swap devices. The problem is that our team is all focusing on phones, we won't have any resource and bandwidth on PC and server. A more realistic goal is that we at least let the solutions benefit phones and similar embedded Linux and extend it to more a= reas such as PC and server. I'd be quite happy if you or other people can join together on PC and serve= r. > > > and on the other hand, a lot > > of modification is required for the swapcache path. in OPPO's code[1], = we did > > bring-up both paths, but the swapcache path is much much more complicat= ed > > than the SYNC path and hasn't really noticeable improvement. > > > > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/on= eplus/sm8650_u_14.0.0_oneplus12 > > That's great. Please cleanup the code and post it to mailing list. Why > doesn't it help? IIUC, it can optimize TLB at least. I agree this can improve but most of the anon pages are single-process mapped. only quite a few pages go to the readahead code path on phones. That's why there is no noticeable improvement finally. I understand all the benefits you mentioned on changing readahead, but simply because those kinds of pages are really really rare, so improving that path doesn't really help Android devices. > > >> > >> > So I'd rather more aggressively get large folios swap-in involved > >> > than depending on readahead. > >> > >> We can take advantage of readahead algorithm in SWAP_SYCHRONOUS > >> optimization too. The sub-pages that is not accessed by page fault ca= n > >> be treated as readahead. I think that is a better policy than > >> allocating large folio always. This is always true even in do_anonymous_page(). but we don't worry that too much as we have per-size control. overload has the chance to set its preferences. /* * Get a list of all the (large) orders below PMD_ORDER that are en= abled * for this vma. Then filter out the orders that can't be allocated= over * the faulting address and still be fully contained in the vma. */ orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, false, true= , true, BIT(PMD_ORDER) - 1); orders =3D thp_vma_suitable_orders(vma, vmf->address, orders); On the other hand, we are not always allocating large folios. we are alloca= ting large folios when the swapped-out folio was large. This is quite important = to an embedded linux, as swap is happening so often. more than half memory can be in swap, if we swap-out them as a large folio, but swap them in a small, we immediately lose all advantages such as less page faults, CONT-PT= E etc. > > > > Considering the zsmalloc optimization, it would be a better choice to > > always allocate > > large folios if we find those swap entries are for a swapped-out large = folio. as > > decompressing just once, we get all subpages. > > Some hardware accelerators are even able to decompress a large folio wi= th > > multi-hardware threads, for example, 16 hardware threads can compress > > each subpage of a large folio at the same time, it is just as fast as > > decompressing > > one subpage. > > > > for platforms without the above optimizations, a proper upper limit > > will help them > > disable the large folios swap-in or decrease the impact. For example, > > if the upper > > limit is 0-order, we are just removing this patchset. if the upper > > limit is 2 orders, we > > are just like BASE_PAGE size is 16KiB. > > > >> > >> >> > >> >> To do that, we need to track whether the sub-pages are accessed. I > >> >> guess we need that information for large file folio readahead too. > >> >> > >> >> Hi, Matthew, > >> >> > >> >> Can you help us on tracking whether the sub-pages of a readahead la= rge > >> >> folio has been accessed? > >> >> > >> >> > Right now, we are re-faulting large folios which are still in swa= pcache as a > >> >> > whole, this can effectively decrease extra loops and early-exitin= gs which we > >> >> > have increased in arch_swap_restore() while supporting MTE restor= e for folios > >> >> > rather than page. On the other hand, it can also decrease do_swap= _page as > >> >> > PTEs used to be set one by one even we hit a large folio in swapc= ache. > >> >> > > >> >> > > -- > Best Regards, > Huang, Ying Thanks Barry