From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC91EC27C79 for ; Thu, 20 Jun 2024 08:11:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 80EAF6B00E1; Thu, 20 Jun 2024 04:11:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7BDE36B00E2; Thu, 20 Jun 2024 04:11:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 65E496B00E3; Thu, 20 Jun 2024 04:11:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 4B9D06B00E1 for ; Thu, 20 Jun 2024 04:11:59 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 00C67A186A for ; Thu, 20 Jun 2024 08:11:58 +0000 (UTC) X-FDA: 82250548716.27.96DFE12 Received: from mail-vs1-f45.google.com (mail-vs1-f45.google.com [209.85.217.45]) by imf18.hostedemail.com (Postfix) with ESMTP id 333FB1C0004 for ; Thu, 20 Jun 2024 08:11:56 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eSagdf3e; spf=pass (imf18.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718871108; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XEIKPJ7y1kdYdRTFvvJFjgXvRRvjBJEQNe3nEPL5kxU=; b=TNI3FDvUUjjTsn1V2Ccp0DuLMBn7qePerLgwoVFK2n7KuvKreGCfzL0rrhVkPvKYJ5HqGi K0if1yigzTrS/C9SaVcmLqPpUvkVdFatmY8GEUJQBJjV/j0YFvyVx+uSkJ7p0HezwxxPsV 2I2ZMCvMSU+BAy7ciNMknlnu2yG8Q4Y= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718871108; a=rsa-sha256; cv=none; b=PUtKTPi9hpHC5zq5GnhkmK7pDVzAreRK83nendntbTz0S+10mItgm4FXFfREKS+7pUgOxe artluLKtTsBZfM9frItJV2FADFPzgfRrHX1RDN7dlqSE/AkXJYu8pCbhnmzk+g2Uxr5F4R R1Nxm093LbROxcREcrLjAnD/wj6VWyY= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eSagdf3e; spf=pass (imf18.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vs1-f45.google.com with SMTP id ada2fe7eead31-48c3402e658so237976137.0 for ; Thu, 20 Jun 2024 01:11:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718871116; x=1719475916; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XEIKPJ7y1kdYdRTFvvJFjgXvRRvjBJEQNe3nEPL5kxU=; b=eSagdf3eQgtn9tHzJ8Eam/ELCItG1L49H+BJKz2MkYpsD940nQZEiJBa177bjj/aAj Y8ziOyCuRT09bcu5Vv4H6way3JeSvUShdT60k4nF7kUvb+KypQg/Uk4bEK0QFir2QMzE gevotIhkqSlfQk6MGmsxpm+a2V42jC+CiMqagMX9Si+SY1rb9YhgezTTIPnEB3z0gtCO SmKOGjdti8xcd1+I5lc9xs+XWW43VDlorbTWIDjO7pgA/uM9qC8/3MWh8G1faArUGKT3 ug4haC2PZV9MFDEcHXQ9VOOOikZiKYzY3P6PhEr+PYTUbLz9+3DzrLbv+SGMn8E3KmKj SIEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718871116; x=1719475916; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XEIKPJ7y1kdYdRTFvvJFjgXvRRvjBJEQNe3nEPL5kxU=; b=qC2PlCC1cfvNqrvkE5XTdlBay9rnC+uRiyPR2UhvUEywhxYM2qw9YYiJYBSDD8xWnr f+9uwFvX7eQiVFfe4rWailanzEdTuJwNiayE1pwPN3jKOvnDE2ouhSKLS49Bj9pN5jHq mmikRwOKjgrjR/XeWc68kzCfLPSlVujD4IKngWlvL5PNoAJQB1+OCCfqWFJZ7mMXfkin buNTaL/AhmmJnOKjyiLyV4dw2S+3Iyq7J/KF2KVSQuSmEyF+22YzYWjuDMq061VbMed5 BbAG5kkI4MBg1QdrXYN6Z08zPZHNgYSAv+kdACpVFBpw82bXGpSFSkjYdT1zD+Zv2yzR 47WA== X-Forwarded-Encrypted: i=1; AJvYcCVxCirn3V5g/lDBJlnvmRP/UEJk+imQoVR23G2s74Gn0/slj0GgxwO/wCGLnYzAJl7f2Z3yk6W62gI3oy1GxtHrl50= X-Gm-Message-State: AOJu0YxJnR1ZLwftAozVCJMKTdtj5eoI/bpfq8ermxcpIb+NxhiVIxsX opUGf0aIcM9dvT+CLrNtG9KbBuUdCy1VkFy+eooG9+AOkWldjrYEo8W0a3ecF7pCcF+cA0TyyQo 8xZMRUiLEd/O+G1+k8LZ8MW41Nyo= X-Google-Smtp-Source: AGHT+IGFz2JW+MJ4O4T4oexOss5rbUQz/KoCHxu1G/pbPZxJenuIrpS4L8GCy02dj9/kC7RZ4RU2rgWGkiGiKKYf8Ok= X-Received: by 2002:a05:6102:3a43:b0:48d:9d02:81f3 with SMTP id ada2fe7eead31-48f130595efmr5106260137.11.1718871116059; Thu, 20 Jun 2024 01:11:56 -0700 (PDT) MIME-Version: 1.0 References: <20240620002648.75204-1-21cnbao@gmail.com> <87zfrg2xce.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o77w2nrw.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzik2kcq.fsf@yhuang6-desk2.ccr.corp.intel.com> <877cek2gf9.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <877cek2gf9.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Barry Song <21cnbao@gmail.com> Date: Thu, 20 Jun 2024 20:11:44 +1200 Message-ID: Subject: Re: [PATCH] selftests/mm: Introduce a test program to assess swap entry allocation for thp_swapout To: "Huang, Ying" Cc: akpm@linux-foundation.org, shuah@kernel.org, linux-mm@kvack.org, ryan.roberts@arm.com, chrisl@kernel.org, david@redhat.com, hughd@google.com, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 333FB1C0004 X-Stat-Signature: agj1xmq7hi5qdfbcta3qn6dpbn3h9m5f X-HE-Tag: 1718871116-492396 X-HE-Meta: U2FsdGVkX1/KiftNbh65yo6NkZk3xY2PukH+tkWJuHWfZZ9ZJckC+g05Aj2rf3wEZn2p91+ZrnqE0XtVbiTOnNSJTXpyISROsxq9RlIBWpc58WNYUOdyFginjn6DjI/rM0/l9eZi/Ec4nBwCsW1wgHi9jqbRNF0nJ7ZLbuMocQaShOzkFM+nfLRWuTjkrqahTWYgoONoWmhO4QWNWtH95TOng6DV3ex3TZ8UyAkceC2s/kNqoxquAn0Qzt3pvzcWnfMAofa2mZE0i/ixarjBuz4DbCj98fZxzXCMTvEDC/GnhqembeQ+PZjgQNOiPpfPs/ui728gjDbd2TFZEbEn+5sk/tFVcgm6/WEhp2cPV8QYkDoGjfEPzkh/ZOUg6w+N7RYjLg4CT54Yp3PatcUZAoktLiDgx0nceCBd8Aic2H0xfxM2PhfVdy5IVSOnT7sK0WFYt8DD/tWvjRpn7yrH1aU3nU4oIrd2X0A5uIWgH5yMzA85Lphl1rbMX5Slt37auZwmVh90/RZXxBK6d0KD9Xbd9Jarl1mAQLvYy9pLgCHGpyT6SLtUgHc4ZC7W9cHeHB5+VOryJdr0M0DtMeL9HIxVDFWGvMXdgt8HKgy8VxNlCIj/38gGVXm76QtHikogrIeES+jN+N9L+Nt6VCbaZ3Wrwuyqz42X/7y9tWsdOEkdMETMipSgrd33QrO1myRNAbxDXpx77L9iy8gHctq9T816005xgBmw8QNb7rAxghNkJr3eHoHU9bR+7TbDrcozXnuixCopouMNU889mV5D+y620fKPD0XzZIaKyOs7BVKQydEsy66E6C6qd0yb+6V3Q9tKm3PkqmkzyOKaBnaHsHISKI368oBMJM26YlgIQJ9Yn/GYAtfKC/t5VdUkGigFHhykJfm6NY5EEDuw8Q7Z6Y9gN+zOtnR6DC2CBvmYs9jyJO+zZRHpb8ZFdk1R4XRkaq5vkrRT1GxCnAuCtg5 tu9Q9ZTI M3gn78YKSDqtRC95NvONYx5VV0eVhl9NZ0NTsVmloeEOBiW9qk0HL1Ns+iwyGf5V2yW3jzjOMNPR52pCbGMmXweO7IinFCioBzmV4kXw0Pa9ushT2VrAYQYd/73eaoJ3hkVawFjHzYrdagvhQkXn74b8xt6alic+fS6KLMTD/DikOhQpPwjVNB7ZsDXMt/X/CMfUYHB1Ohp+yJQgzhE75A0ZuB+rtwvVQ2CPFHQ7DxwIwTHrmzKr5Q6I8n9K+k82aBF5gKt+hIo6zl5sVJuL0T0ERvi4EwvIKlyso8Nh5WBdlsj+iWsSUOpPr8+CGhd8YK0KcjJi+ohDVsBG6Pjiqa7g/t7kMYlJwAyD+KtFNJxoSryFbwuJLkxBcXB3ors03Pca2djFE44Or5l0UfAJOBNOv/Q6rNOVq2KwpyK/Ck6ndjDRgdxwNh49db2y09ng0LlOuRNrnP0PNdbUX9qLYPnmRD7DllbD1sisp X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 20, 2024 at 8:01=E2=80=AFPM Huang, Ying = wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Thu, Jun 20, 2024 at 6:36=E2=80=AFPM Huang, Ying wrote: > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > On Thu, Jun 20, 2024 at 5:22=E2=80=AFPM Huang, Ying wrote: > >> >> > >> >> Barry Song <21cnbao@gmail.com> writes: > >> >> > >> >> > On Thu, Jun 20, 2024 at 1:55=E2=80=AFPM Huang, Ying wrote: > >> >> >> > >> >> >> Barry Song <21cnbao@gmail.com> writes: > >> >> >> > >> >> >> > From: Barry Song > >> >> >> > > >> >> >> > Both Ryan and Chris have been utilizing the small test program= to aid > >> >> >> > in debugging and identifying issues with swap entry allocation= . While > >> >> >> > a real or intricate workload might be more suitable for assess= ing the > >> >> >> > correctness and effectiveness of the swap allocation policy, a= small > >> >> >> > test program presents a simpler means of understanding the pro= blem and > >> >> >> > initially verifying the improvements being made. > >> >> >> > > >> >> >> > Let's endeavor to integrate it into the self-test suite. Altho= ugh it > >> >> >> > presently only accommodates 64KB and 4KB, I'm optimistic that = we can > >> >> >> > expand its capabilities to support multiple sizes and simulate= more > >> >> >> > complex systems in the future as required. > >> >> >> > >> >> >> IIUC, this is a performance test program instead of functionalit= y test > >> >> >> program. Does it match the purpose of the kernel selftest? > >> >> > > >> >> > I have a differing perspective. I maintain that the functionality= is > >> >> > not functioning > >> >> > as expected. Despite having all the necessary resources for alloc= ation, failure > >> >> > persists, indicating a lack of functionality. > >> >> > >> >> Is there any user visual functionality issue? > >> > > >> > Definitely not. If a plane can't take off, taking a train and preten= ding > >> > there's no functionality issue isn't a solution. > >> > >> I always think that performance optimization is great work. However, = it > >> is not functionality work. > >> > >> > I have never assigned blame for any mistakes here. On the contrary, > >> > I have 100% appreciation for Ryan's work in at least initiating mTHP > >> > swapout w/o being split. > >> > > >> > It took countless experiments for humans to make airplanes commercia= lly > >> > viable, but the person who created the first flying airplane remains= the > >> > greatest. Similarly, Ryan's efforts, combined with your review of hi= s patch, > >> > have enabled us to achieve a better goal here. Without your work, we= can't > >> > get here at all. > >> > >> Thanks! > >> > >> > However, this is never a reason to refuse to acknowledge that this f= eature > >> > is not actually working. > >> > >> It just works for some workloads, not for some others. > >> > >> >> > >> >> >> > >> >> >> > Signed-off-by: Barry Song > >> >> >> > --- > >> >> >> > tools/testing/selftests/mm/Makefile | 1 + > >> >> >> > .../selftests/mm/thp_swap_allocator_test.c | 192 +++++++++= +++++++++ > >> >> >> > 2 files changed, 193 insertions(+) > >> >> >> > create mode 100644 tools/testing/selftests/mm/thp_swap_alloca= tor_test.c > >> >> >> > > >> >> >> > diff --git a/tools/testing/selftests/mm/Makefile b/tools/testi= ng/selftests/mm/Makefile > >> >> >> > index e1aa09ddaa3d..64164ad66835 100644 > >> >> >> > --- a/tools/testing/selftests/mm/Makefile > >> >> >> > +++ b/tools/testing/selftests/mm/Makefile > >> >> >> > @@ -65,6 +65,7 @@ TEST_GEN_FILES +=3D mseal_test > >> >> >> > TEST_GEN_FILES +=3D seal_elf > >> >> >> > TEST_GEN_FILES +=3D on-fault-limit > >> >> >> > TEST_GEN_FILES +=3D pagemap_ioctl > >> >> >> > +TEST_GEN_FILES +=3D thp_swap_allocator_test > >> >> >> > TEST_GEN_FILES +=3D thuge-gen > >> >> >> > TEST_GEN_FILES +=3D transhuge-stress > >> >> >> > TEST_GEN_FILES +=3D uffd-stress > >> >> >> > diff --git a/tools/testing/selftests/mm/thp_swap_allocator_tes= t.c b/tools/testing/selftests/mm/thp_swap_allocator_test.c > >> >> >> > new file mode 100644 > >> >> >> > index 000000000000..4443a906d0f8 > >> >> >> > --- /dev/null > >> >> >> > +++ b/tools/testing/selftests/mm/thp_swap_allocator_test.c > >> >> >> > @@ -0,0 +1,192 @@ > >> >> >> > +// SPDX-License-Identifier: GPL-2.0-or-later > >> >> >> > +/* > >> >> >> > + * thp_swap_allocator_test > >> >> >> > + * > >> >> >> > + * The purpose of this test program is helping check if THP s= wpout > >> >> >> > + * can correctly get swap slots to swap out as a whole instea= d of > >> >> >> > + * being split. It randomly releases swap entries through mad= vise > >> >> >> > + * DONTNEED and do swapout on two memory areas: a memory area= for > >> >> >> > + * 64KB THP and the other area for small folios. The second m= emory > >> >> >> > + * can be enabled by "-s". > >> >> >> > + * Before running the program, we need to setup a zRAM or sim= ilar > >> >> >> > + * swap device by: > >> >> >> > + * echo lzo > /sys/block/zram0/comp_algorithm > >> >> >> > + * echo 64M > /sys/block/zram0/disksize > >> >> >> > + * echo never > /sys/kernel/mm/transparent_hugepage/hugepage= s-2048kB/enabled > >> >> >> > + * echo always > /sys/kernel/mm/transparent_hugepage/hugepag= es-64kB/enabled > >> >> >> > + * mkswap /dev/zram0 > >> >> >> > + * swapon /dev/zram0 > >> >> >> > + * The expected result should be 0% anon swpout fallback rati= o w/ or > >> >> >> > + * w/o "-s". > >> >> >> > + * > >> >> >> > + * Author(s): Barry Song > >> >> >> > + */ > >> >> >> > + > >> >> >> > +#define _GNU_SOURCE > >> >> >> > +#include > >> >> >> > +#include > >> >> >> > +#include > >> >> >> > +#include > >> >> >> > +#include > >> >> >> > +#include > >> >> >> > +#include > >> >> >> > + > >> >> >> > +#define MEMSIZE_MTHP (60 * 1024 * 1024) > >> >> >> > +#define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024) > >> >> >> > +#define ALIGNMENT_MTHP (64 * 1024) > >> >> >> > +#define ALIGNMENT_SMALLFOLIO (4 * 1024) > >> >> >> > +#define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024) > >> >> >> > +#define TOTAL_DONTNEED_SMALLFOLIO (768 * 1024) > >> >> >> > +#define MTHP_FOLIO_SIZE (64 * 1024) > >> >> >> > + > >> >> >> > +#define SWPOUT_PATH \ > >> >> >> > + "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stat= s/swpout" > >> >> >> > +#define SWPOUT_FALLBACK_PATH \ > >> >> >> > + "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stat= s/swpout_fallback" > >> >> >> > + > >> >> >> > +static void *aligned_alloc_mem(size_t size, size_t alignment) > >> >> >> > +{ > >> >> >> > + void *mem =3D NULL; > >> >> >> > + > >> >> >> > + if (posix_memalign(&mem, alignment, size) !=3D 0) { > >> >> >> > + perror("posix_memalign"); > >> >> >> > + return NULL; > >> >> >> > + } > >> >> >> > + return mem; > >> >> >> > +} > >> >> >> > + > >> >> >> > +static void random_madvise_dontneed(void *mem, size_t mem_siz= e, > >> >> >> > + size_t align_size, size_t total_dontneed_size) > >> >> >> > +{ > >> >> >> > + size_t num_pages =3D total_dontneed_size / align_size; > >> >> >> > + size_t i; > >> >> >> > + size_t offset; > >> >> >> > + void *addr; > >> >> >> > + > >> >> >> > + for (i =3D 0; i < num_pages; ++i) { > >> >> >> > + offset =3D (rand() % (mem_size / align_size)) * = align_size; > >> >> >> > + addr =3D (char *)mem + offset; > >> >> >> > + if (madvise(addr, align_size, MADV_DONTNEED) != =3D 0) > >> >> >> > + perror("madvise dontneed"); > >> >> >> > >> >> >> IIUC, this simulates align_size (generally 64KB) swap-in. That = is, it > >> >> >> simulate the effect of large size swap-in when it's not availabl= e in > >> >> >> kernel. If we have large size swap-in in kernel in the future, = this > >> >> >> becomes unnecessary. > >> >> >> > >> >> >> Additionally, we have not reached the consensus that we should a= lways > >> >> >> swap-in with swapped-out size. So, I suspect that this test may= not > >> >> >> reflect real situation in the future. Although it doesn't refle= ct > >> >> >> current situation too. > >> >> > > >> >> > Disagree again. releasing the whole mTHP swaps is the best case. = Even in > >> >> > the best-case scenario, if we fail, it raises concerns for handli= ng potentially > >> >> > more challenging situations. > >> >> > >> >> Repeating sequential anonymous pages writing is the best case. > >> > > >> > I define the best case as the scenario with the least chance of crea= ting > >> > fragments within swapfiles for mTHP to swap out. There is no real > >> > difference whether this is done through swapin or madv_dontneed. > >> > >> IMO, swapin is much more important than madv_dontneed. Because most > >> users use swapin automatically, but few use madv_dontneed by hand. So= , > >> I think swapin/swapout test is much more important than madv_dontneed. > >> I don't like this test case because madv_dontneed isn't typical or > >> basic. > > > > Disliking DONTNEED isn't a sufficient reason to reject this test progra= m because > > no single small program can report swapout counters, swapout fallback c= ounters, > > and fallback ratios within several minutes for 100 iterations. That's > > precisely why > > we need it, at least initially. We can enhance it further if it lacks > > certain functionalities > > that people desire. > > > > The entire purpose of MADV_DONTNEED is to simulate a scenario where all > > slots are released as a whole, preventing the creation of fragments, wh= ich is > > most favorable for swap allocation. I believe there is no difference be= tween > > using MADV_DONTNEED or swapin for this purpose. But I am perfectly fine > > with switching to swapin to replace MADV_DONTNEED in v2. > > Great! Thanks for doing this! > > And even better, can we not make swap-in address aligned and size > aligned? It's too unrealistic. It's good to consider some level of > spatial locality, for example, swap-in random number of pages > sequentially at some random addresses. That could be a good general > test program. We can use it to evaluate further swap optimizations, for > example, to evaluate the memory wastage of some swap-in size policy. I wholeheartedly agree with everything mentioned above; these are actually part of my plan as incremental patches. This initial commit serves as the first step of the three I proposed in the last email. > > And, we don't need PAGEOUT too, just use large virtual address space in > test programs. We can trigger swapout in more common way. I'm not particularly enthusiastic about this idea, as I expect the test pro= gram to run quickly. A large virtual address space would result in long waiting = times for the test results, as it relies on vmscan. Therefore, I hope we can use = real workloads to achieve this instead. > > [snip] > > -- > Best Regards, > Huang, Ying Thanks Barry