From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9528C2BA18 for ; Thu, 20 Jun 2024 08:28:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 35C716B03CA; Thu, 20 Jun 2024 04:28:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 30C586B03CB; Thu, 20 Jun 2024 04:28:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1ACC86B03CC; Thu, 20 Jun 2024 04:28:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id EBA476B03CA for ; Thu, 20 Jun 2024 04:28:29 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 899921A0C92 for ; Thu, 20 Jun 2024 08:28:29 +0000 (UTC) X-FDA: 82250590338.19.9A272AD Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.14]) by imf21.hostedemail.com (Postfix) with ESMTP id 97D821C0012 for ; Thu, 20 Jun 2024 08:28:26 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=HO5NqSYH; spf=pass (imf21.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718872102; a=rsa-sha256; cv=none; b=AJjvVDx+9Mvz9aae7iUWzm9ktwyohFqjYZvy6XCmb/rtQfmm77wvPs+an/gpDEf744ylLg drVyuwg49++vjAsPKGr7ggLiXyh0Zn0CkhO7SHU6yz1EwrAgoz+S2fMirICWixdu5Bzxdi UeiF/Kc6ZmL4uvxBpovZcxBUAEfjGVA= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=HO5NqSYH; spf=pass (imf21.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718872102; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mb0xtvu1C1BkAjOjZwBo16yHlUMT1sUPR8C8eStATMU=; b=go66QpqUb/033O20QhcN1P/KAnHzexH+dZ2OT4hmhNI0hTdFEf9T4lKkiHEkV425foxUQ2 31wT8ie+vl5o89aIfHoZg1JojHCZaIWDUaCQXHEcWgq/XYmKhlbPock3FL4JUE57dPRbyI UP9DjUqen+rYWcDVFV0gECr1icahqJo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1718872107; x=1750408107; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=eXDHLMT4loxK5+Z+2y8OrG2bpnMK6lwhPIoMBY6hP10=; b=HO5NqSYH5/oPrcWSEFgge16CZZrh/Dq6SSMAka8ElEjk4IvcfMp/kCEj z0Y461jE/MfcltEjKzGCQND2eJyYz1cHSgpk/8ZNjt8lnbhV5PDzpAy4r nVbfMzR384yAFkX3CY+IzBtN2UxvU9Qf6sb9lZ7k74L2fA4G6lZhvnHiB becefCG8AGUr2OSW8/vx+IZgSoeNR2fydSiMCNOJQoGw+bgMqCKTjLmsR 8TFwUqQ1OY3u8jyJW6j27mpcaaDVj97tBj6zA+YnqPKWReQDk+k3FV1Sc cse03iKH4IdAkaWDNfNghfT5TNfeLxHE3ECedfgQgtKzUruX4iRJU79XX A==; X-CSE-ConnectionGUID: 6FYVlg1/Q/SgxYgmtJFQ5A== X-CSE-MsgGUID: oc+VaVg6TYCnzlWkvM8VGA== X-IronPort-AV: E=McAfee;i="6700,10204,11108"; a="19647919" X-IronPort-AV: E=Sophos;i="6.08,251,1712646000"; d="scan'208";a="19647919" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jun 2024 01:28:25 -0700 X-CSE-ConnectionGUID: Uq6jFZ83RQ+nll5RgG9C/w== X-CSE-MsgGUID: K2ru+cEISvCP9gfw+h/S1Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,251,1712646000"; d="scan'208";a="42831041" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jun 2024 01:28:22 -0700 From: "Huang, Ying" To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, shuah@kernel.org, linux-mm@kvack.org, ryan.roberts@arm.com, chrisl@kernel.org, david@redhat.com, hughd@google.com, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Barry Song Subject: Re: [PATCH] selftests/mm: Introduce a test program to assess swap entry allocation for thp_swapout In-Reply-To: (Barry Song's message of "Thu, 20 Jun 2024 20:11:44 +1200") References: <20240620002648.75204-1-21cnbao@gmail.com> <87zfrg2xce.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o77w2nrw.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzik2kcq.fsf@yhuang6-desk2.ccr.corp.intel.com> <877cek2gf9.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 20 Jun 2024 16:26:30 +0800 Message-ID: <8734p82f61.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Stat-Signature: x1qfum7pdhmxtkkpkoosotcaku61bmn4 X-Rspamd-Queue-Id: 97D821C0012 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1718872106-690702 X-HE-Meta: U2FsdGVkX1/D4Ml4R+QrKQ+cy1MsvLhZZzjf3FDp1cfW1eQrfz0/+QGhNZK2401wte19AKcGyt3sTcuIHo0igkxMBlmWEjU/u7WXPV0jvTSH0gDzMm0YhRPqFoFvqCke6v4jTSmB9zRPKXZaU5RrOTrwxOt9itzQeW9jWEOXVavrf+EJWaeEua2FNaTrmOkr2v1y+rfAaD8Qr7muu+8Xh5kOvDS/vhysItCmzlzQL2U+aTrDkwc9ShF1MFemZboJ7JPGsgXuniiWhJCsWey+KkgP658zM7SHfrMErGIFKKrTKnpQHqrfghOA4yCyJ0eroyJeK2CLc15yPNRs1LrDszPWP4BfdK80tVL/9GQEWGYhFE0Z+OS6tCWzAYbs3V3S1Z1+6AzQQa/NHz/aA0bjEwjfVsX4C1ssSAB1DknOxBo2DJL4ubfZYGzZyIMnie+ct94eQ3/vF3OEvi8gQzy4DpnSm5krj1DfwS4J2kgiELuMnWBmGN3cgvwK0zfrfJG4kimtDV+StHBPlgsegZcN21ZZu+owaxsXifhGVlmFYiWQbiXzPzzxRkvY0pxiFZNTwxxjeMEiRIc17f6AW9MJunbrocmnm00FhcdkyT+ts8XQPHO403xHQ7BIwLNA6+vJWeVQwLtU9/Z43jB+lcu1L0mO7t7IyDxSAka9kQ9tTmFFGmC8twZs1H0EKB6RDdzle4ESjQu4HCjutZowD414C/ZYpTKH+yYK43oijAT7YFNSOj9EML8JVrO2OsanIqs0n5JAFDl9jV3VFBo/ilM2fvXJNYoJX6LwypluJCpeECnElu3Pl4Rv7qf/BQmr5ccaa2ImjUhwqkZwuJiJju8Jx4R7otBamG0/HWS51/K7lojPANTaOQ8dhLdYYYjxt7R/WLh/YVzk7OYjiXNe/C8iK05m5bosyzQ9fxqt6JBDaDUbFGCqd1TLVa9ndezKNeW5mhIYgkSxUgqaTy4Keo2 23qmvDj3 RI0KHhYtcbYH/eMBPxoDomDsvjzn1beQapnFkaEglybE88tWo/8P6pFgMiuf1l2JY4JyhWJ65J7wG8hRFbFHcQUlNWYkk3wS9OGDytlrBp+vNHkuDntk8bpRBevdGTQBBUXHTjeA0HoY6+a92Xw/0N/YklZ4hiiJ/OtBMuh54p3gglds5kN4ZYVdFmB2H+8FkabypbIS6VLjkA/UEb6QeYqqibPL/U0jQ19rBEHd1I5tAsfW6vE2ir+ly0x91IHoF7sdZq4/qaugIXq0lGYPp7KT9aPurVVAdGGIBsDLnHLJpylYvB1Zoa5KrhQeu4RQwqNt0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Barry Song <21cnbao@gmail.com> writes: > On Thu, Jun 20, 2024 at 8:01=E2=80=AFPM Huang, Ying wrote: >> >> Barry Song <21cnbao@gmail.com> writes: >> >> > On Thu, Jun 20, 2024 at 6:36=E2=80=AFPM Huang, Ying wrote: >> >> >> >> Barry Song <21cnbao@gmail.com> writes: >> >> >> >> > On Thu, Jun 20, 2024 at 5:22=E2=80=AFPM Huang, Ying wrote: >> >> >> >> >> >> Barry Song <21cnbao@gmail.com> writes: >> >> >> >> >> >> > On Thu, Jun 20, 2024 at 1:55=E2=80=AFPM Huang, Ying wrote: >> >> >> >> >> >> >> >> Barry Song <21cnbao@gmail.com> writes: >> >> >> >> >> >> >> >> > From: Barry Song >> >> >> >> > >> >> >> >> > Both Ryan and Chris have been utilizing the small test progra= m to aid >> >> >> >> > in debugging and identifying issues with swap entry allocatio= n. While >> >> >> >> > a real or intricate workload might be more suitable for asses= sing the >> >> >> >> > correctness and effectiveness of the swap allocation policy, = a small >> >> >> >> > test program presents a simpler means of understanding the pr= oblem and >> >> >> >> > initially verifying the improvements being made. >> >> >> >> > >> >> >> >> > Let's endeavor to integrate it into the self-test suite. Alth= ough it >> >> >> >> > presently only accommodates 64KB and 4KB, I'm optimistic that= we can >> >> >> >> > expand its capabilities to support multiple sizes and simulat= e more >> >> >> >> > complex systems in the future as required. >> >> >> >> >> >> >> >> IIUC, this is a performance test program instead of functionali= ty test >> >> >> >> program. Does it match the purpose of the kernel selftest? >> >> >> > >> >> >> > I have a differing perspective. I maintain that the functionalit= y is >> >> >> > not functioning >> >> >> > as expected. Despite having all the necessary resources for allo= cation, failure >> >> >> > persists, indicating a lack of functionality. >> >> >> >> >> >> Is there any user visual functionality issue? >> >> > >> >> > Definitely not. If a plane can't take off, taking a train and prete= nding >> >> > there's no functionality issue isn't a solution. >> >> >> >> I always think that performance optimization is great work. However,= it >> >> is not functionality work. >> >> >> >> > I have never assigned blame for any mistakes here. On the contrary, >> >> > I have 100% appreciation for Ryan's work in at least initiating mTHP >> >> > swapout w/o being split. >> >> > >> >> > It took countless experiments for humans to make airplanes commerci= ally >> >> > viable, but the person who created the first flying airplane remain= s the >> >> > greatest. Similarly, Ryan's efforts, combined with your review of h= is patch, >> >> > have enabled us to achieve a better goal here. Without your work, w= e can't >> >> > get here at all. >> >> >> >> Thanks! >> >> >> >> > However, this is never a reason to refuse to acknowledge that this = feature >> >> > is not actually working. >> >> >> >> It just works for some workloads, not for some others. >> >> >> >> >> >> >> >> >> >> >> >> >> > Signed-off-by: Barry Song >> >> >> >> > --- >> >> >> >> > tools/testing/selftests/mm/Makefile | 1 + >> >> >> >> > .../selftests/mm/thp_swap_allocator_test.c | 192 ++++++++= ++++++++++ >> >> >> >> > 2 files changed, 193 insertions(+) >> >> >> >> > create mode 100644 tools/testing/selftests/mm/thp_swap_alloc= ator_test.c >> >> >> >> > >> >> >> >> > diff --git a/tools/testing/selftests/mm/Makefile b/tools/test= ing/selftests/mm/Makefile >> >> >> >> > index e1aa09ddaa3d..64164ad66835 100644 >> >> >> >> > --- a/tools/testing/selftests/mm/Makefile >> >> >> >> > +++ b/tools/testing/selftests/mm/Makefile >> >> >> >> > @@ -65,6 +65,7 @@ TEST_GEN_FILES +=3D mseal_test >> >> >> >> > TEST_GEN_FILES +=3D seal_elf >> >> >> >> > TEST_GEN_FILES +=3D on-fault-limit >> >> >> >> > TEST_GEN_FILES +=3D pagemap_ioctl >> >> >> >> > +TEST_GEN_FILES +=3D thp_swap_allocator_test >> >> >> >> > TEST_GEN_FILES +=3D thuge-gen >> >> >> >> > TEST_GEN_FILES +=3D transhuge-stress >> >> >> >> > TEST_GEN_FILES +=3D uffd-stress >> >> >> >> > diff --git a/tools/testing/selftests/mm/thp_swap_allocator_te= st.c b/tools/testing/selftests/mm/thp_swap_allocator_test.c >> >> >> >> > new file mode 100644 >> >> >> >> > index 000000000000..4443a906d0f8 >> >> >> >> > --- /dev/null >> >> >> >> > +++ b/tools/testing/selftests/mm/thp_swap_allocator_test.c >> >> >> >> > @@ -0,0 +1,192 @@ >> >> >> >> > +// SPDX-License-Identifier: GPL-2.0-or-later >> >> >> >> > +/* >> >> >> >> > + * thp_swap_allocator_test >> >> >> >> > + * >> >> >> >> > + * The purpose of this test program is helping check if THP = swpout >> >> >> >> > + * can correctly get swap slots to swap out as a whole inste= ad of >> >> >> >> > + * being split. It randomly releases swap entries through ma= dvise >> >> >> >> > + * DONTNEED and do swapout on two memory areas: a memory are= a for >> >> >> >> > + * 64KB THP and the other area for small folios. The second = memory >> >> >> >> > + * can be enabled by "-s". >> >> >> >> > + * Before running the program, we need to setup a zRAM or si= milar >> >> >> >> > + * swap device by: >> >> >> >> > + * echo lzo > /sys/block/zram0/comp_algorithm >> >> >> >> > + * echo 64M > /sys/block/zram0/disksize >> >> >> >> > + * echo never > /sys/kernel/mm/transparent_hugepage/hugepag= es-2048kB/enabled >> >> >> >> > + * echo always > /sys/kernel/mm/transparent_hugepage/hugepa= ges-64kB/enabled >> >> >> >> > + * mkswap /dev/zram0 >> >> >> >> > + * swapon /dev/zram0 >> >> >> >> > + * The expected result should be 0% anon swpout fallback rat= io w/ or >> >> >> >> > + * w/o "-s". >> >> >> >> > + * >> >> >> >> > + * Author(s): Barry Song >> >> >> >> > + */ >> >> >> >> > + >> >> >> >> > +#define _GNU_SOURCE >> >> >> >> > +#include >> >> >> >> > +#include >> >> >> >> > +#include >> >> >> >> > +#include >> >> >> >> > +#include >> >> >> >> > +#include >> >> >> >> > +#include >> >> >> >> > + >> >> >> >> > +#define MEMSIZE_MTHP (60 * 1024 * 1024) >> >> >> >> > +#define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024) >> >> >> >> > +#define ALIGNMENT_MTHP (64 * 1024) >> >> >> >> > +#define ALIGNMENT_SMALLFOLIO (4 * 1024) >> >> >> >> > +#define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024) >> >> >> >> > +#define TOTAL_DONTNEED_SMALLFOLIO (768 * 1024) >> >> >> >> > +#define MTHP_FOLIO_SIZE (64 * 1024) >> >> >> >> > + >> >> >> >> > +#define SWPOUT_PATH \ >> >> >> >> > + "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/sta= ts/swpout" >> >> >> >> > +#define SWPOUT_FALLBACK_PATH \ >> >> >> >> > + "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/sta= ts/swpout_fallback" >> >> >> >> > + >> >> >> >> > +static void *aligned_alloc_mem(size_t size, size_t alignment) >> >> >> >> > +{ >> >> >> >> > + void *mem =3D NULL; >> >> >> >> > + >> >> >> >> > + if (posix_memalign(&mem, alignment, size) !=3D 0) { >> >> >> >> > + perror("posix_memalign"); >> >> >> >> > + return NULL; >> >> >> >> > + } >> >> >> >> > + return mem; >> >> >> >> > +} >> >> >> >> > + >> >> >> >> > +static void random_madvise_dontneed(void *mem, size_t mem_si= ze, >> >> >> >> > + size_t align_size, size_t total_dontneed_size) >> >> >> >> > +{ >> >> >> >> > + size_t num_pages =3D total_dontneed_size / align_size; >> >> >> >> > + size_t i; >> >> >> >> > + size_t offset; >> >> >> >> > + void *addr; >> >> >> >> > + >> >> >> >> > + for (i =3D 0; i < num_pages; ++i) { >> >> >> >> > + offset =3D (rand() % (mem_size / align_size)) *= align_size; >> >> >> >> > + addr =3D (char *)mem + offset; >> >> >> >> > + if (madvise(addr, align_size, MADV_DONTNEED) != =3D 0) >> >> >> >> > + perror("madvise dontneed"); >> >> >> >> >> >> >> >> IIUC, this simulates align_size (generally 64KB) swap-in. That= is, it >> >> >> >> simulate the effect of large size swap-in when it's not availab= le in >> >> >> >> kernel. If we have large size swap-in in kernel in the future,= this >> >> >> >> becomes unnecessary. >> >> >> >> >> >> >> >> Additionally, we have not reached the consensus that we should = always >> >> >> >> swap-in with swapped-out size. So, I suspect that this test ma= y not >> >> >> >> reflect real situation in the future. Although it doesn't refl= ect >> >> >> >> current situation too. >> >> >> > >> >> >> > Disagree again. releasing the whole mTHP swaps is the best case.= Even in >> >> >> > the best-case scenario, if we fail, it raises concerns for handl= ing potentially >> >> >> > more challenging situations. >> >> >> >> >> >> Repeating sequential anonymous pages writing is the best case. >> >> > >> >> > I define the best case as the scenario with the least chance of cre= ating >> >> > fragments within swapfiles for mTHP to swap out. There is no real >> >> > difference whether this is done through swapin or madv_dontneed. >> >> >> >> IMO, swapin is much more important than madv_dontneed. Because most >> >> users use swapin automatically, but few use madv_dontneed by hand. S= o, >> >> I think swapin/swapout test is much more important than madv_dontneed. >> >> I don't like this test case because madv_dontneed isn't typical or >> >> basic. >> > >> > Disliking DONTNEED isn't a sufficient reason to reject this test progr= am because >> > no single small program can report swapout counters, swapout fallback = counters, >> > and fallback ratios within several minutes for 100 iterations. That's >> > precisely why >> > we need it, at least initially. We can enhance it further if it lacks >> > certain functionalities >> > that people desire. >> > >> > The entire purpose of MADV_DONTNEED is to simulate a scenario where all >> > slots are released as a whole, preventing the creation of fragments, w= hich is >> > most favorable for swap allocation. I believe there is no difference b= etween >> > using MADV_DONTNEED or swapin for this purpose. But I am perfectly fine >> > with switching to swapin to replace MADV_DONTNEED in v2. >> >> Great! Thanks for doing this! >> >> And even better, can we not make swap-in address aligned and size >> aligned? It's too unrealistic. It's good to consider some level of >> spatial locality, for example, swap-in random number of pages >> sequentially at some random addresses. That could be a good general >> test program. We can use it to evaluate further swap optimizations, for >> example, to evaluate the memory wastage of some swap-in size policy. > > I wholeheartedly agree with everything mentioned above; these are > actually part of my plan as incremental patches. This initial commit > serves as the first step of the three I proposed in the last email. It will be a small test program to implement all these. Don't need to use 3 steps. IMHO, it's not good to optimize for a unrealistic test case with address aligned and size aligned swap-in. It's trivial to remove the alignment requirements. >> And, we don't need PAGEOUT too, just use large virtual address space in >> test programs. We can trigger swapout in more common way. > > I'm not particularly enthusiastic about this idea, as I expect the test p= rogram > to run quickly. A large virtual address space would result in long waitin= g times > for the test results, as it relies on vmscan. Therefore, I hope we can us= e real > workloads to achieve this instead. I have use test program with large virtual address space (in vm-scalability) to do swap test before. It runs really fast. Please give it a try. -- Best Regards, Huang, Ying