From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11DAAC27C79 for ; Thu, 20 Jun 2024 07:25:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4BE566B03E4; Thu, 20 Jun 2024 03:25:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3AB006B03E3; Thu, 20 Jun 2024 03:25:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1FF356B03DA; Thu, 20 Jun 2024 03:25:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id E9CFA6B03D0 for ; Thu, 20 Jun 2024 03:25:56 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 4FB7C140ACC for ; Thu, 20 Jun 2024 07:25:56 +0000 (UTC) X-FDA: 82250432712.11.890D163 Received: from mail-vk1-f181.google.com (mail-vk1-f181.google.com [209.85.221.181]) by imf21.hostedemail.com (Postfix) with ESMTP id 7C8781C0010 for ; Thu, 20 Jun 2024 07:25:54 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=RcMI3fwR; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718868345; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tLXZUXgh5TLk41BARNCX6OxSVAe0KiHJAXDwIUYsvfM=; b=36HPBdw/XAyvPP9ddOgpmSwFlsV2o24Mc4CxPkyhl5xH9+8R2Pa/8XKIROyc5voHKiESrJ qjloxMqj4yYpR9SykAth9RZiqmf9QJkRyL6vLifHZub7Z+L9NA8EOXrRkbKPWzA0X75gB0 SMyCflQaCtZW7IPnttzbBionBj8n7yQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718868345; a=rsa-sha256; cv=none; b=3ww7QU9IM2BK05qwcuuiXveGAAlpZOrri9lqTmG4/RrVNRQr8T6dVcaMQpJGEsntJAweJH JG8cXg8Wb0D2Fn5R0Cbx/REC/MPxvUleGn5tqFZCs85MrUzSO8eD6vij5YTbIs5gouPbpG 2bONuufqVfCl3V0zGhQb7/K7WLNjhxE= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=RcMI3fwR; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vk1-f181.google.com with SMTP id 71dfb90a1353d-4e4efbc3218so201891e0c.0 for ; Thu, 20 Jun 2024 00:25:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718868353; x=1719473153; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tLXZUXgh5TLk41BARNCX6OxSVAe0KiHJAXDwIUYsvfM=; b=RcMI3fwRbdCcNMAE4XRwrq4P3S3WnVD5vMNGdqnMKMcOZX0tdxstM//OaLEv/SUSqp v0U0aKKBJYNumQ9XWhX8T8WosLNZRMvQmYmqqlKVY0yXooD2BiIpu3tGTuojSzdIPcWY aOHmt9LjNdvC02uo1nqtwYwNkKDA1BT7gniAz5SQT3jmZu2Ov5SOTF/e54P2xq/yZsZy AKqCigGgscngYDIC5ZHwfXrUc+NDo3If17QkBjAXAXtLMeBwI4sVB8tEZ7298yT+3mmV jyErKqGJcuKnFKhLQKtrDhpsLgoBkPvo6K/YqDUysUfbV7EQZDFtSyPqrlnUaUmsM7LA 1HiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718868353; x=1719473153; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tLXZUXgh5TLk41BARNCX6OxSVAe0KiHJAXDwIUYsvfM=; b=H+DX6BGNhfXuAh60z6Yj8H1GtA56lZiA9Y3OjJrBzCsU6fm7+WYaolGET62uMgiH5+ zH1AufDnL0ADBUXEYCfBAbXUwMeS2R8rWtV1Gk0j+iWyE1+si0/LbXZCzkPvpCWvfUGm vQIT+y9IdnDEMCP7jq2zEIvQGQv8+ya53NpEQgtXkiMPgUvZyVlwPRDSX9A+mAE7RJoq ChRSH7rTgULnZnpUlrqbbqhkesywYeCOJN7FLIS2DN8DF5jWmB6RM4YsCmtFM/llsZan 0hgP2N5R73QlC0/Lb9Jj3NPQHDRm3Qp8nuiMzh1Lu1+zwRPqzc9OzYSB2tcAgJK16Bai /ZVw== X-Forwarded-Encrypted: i=1; AJvYcCUjf2FNmMbuCvDUzzHOehCpNDPZAT3yFQ775/Z4k1rQPaWG+GfEyX7hLRPhb3FABtjZwMOJr7ZLKmKw8u0ouHyKub8= X-Gm-Message-State: AOJu0YzkH58a79gYvpKkK0aCY5hCgVMDOPTe0bmzvCEhWulzPLMsNDsR S1QXp06UGEkiAyOzdrClPDFppVzYMjUXjwlxl4k12cHd65SA6AnoHIInzl3k4IykL2+FiVBR+cc dkTWPfRKxYbDKn51d6O6c2j6F0Hw= X-Google-Smtp-Source: AGHT+IHN2NVouyF6JqeqBCeqrXcMyZNL0yaQV7MG+sGA6JKJYMOFXjvPfEm9qp3C2rdb0jOmQdg8s3+6cAe/BJfxWEY= X-Received: by 2002:a05:6122:3c88:b0:4eb:12da:14c7 with SMTP id 71dfb90a1353d-4ef276f96b7mr5354704e0c.6.1718868353350; Thu, 20 Jun 2024 00:25:53 -0700 (PDT) MIME-Version: 1.0 References: <20240620002648.75204-1-21cnbao@gmail.com> <87zfrg2xce.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o77w2nrw.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzik2kcq.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87jzik2kcq.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Barry Song <21cnbao@gmail.com> Date: Thu, 20 Jun 2024 19:25:42 +1200 Message-ID: Subject: Re: [PATCH] selftests/mm: Introduce a test program to assess swap entry allocation for thp_swapout To: "Huang, Ying" Cc: akpm@linux-foundation.org, shuah@kernel.org, linux-mm@kvack.org, ryan.roberts@arm.com, chrisl@kernel.org, david@redhat.com, hughd@google.com, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 7C8781C0010 X-Stat-Signature: ri77s7xhysd7z3zuuizcqh6fsyac3wtd X-HE-Tag: 1718868354-813028 X-HE-Meta: U2FsdGVkX18m48n6AizyghQCOC2zf7W+ViLDSAKZ9mcJK4FA96FdPjgOrNsQgBtyZ0OXSVpLcW0TUV59JXwwWY7qQXYbhzYzr0Ystzv4DNNdrDMB//+bQ9/VB24CuVUvisnYjJ0eEwUhHnGUVcPbQnSlmOgAkrCD8EmwD5FEgCYnZ1OyWVJoisrRVKUN8y1oNxqSHmcraGKOszTpfSdJSnW4/oW7XZEoD6pAf7rnMVkDfpEPlC5+WGvtuUNML+I2QV5zjMsBzW9Qu1jjf0++UGtik3u4E7vnK5nXiJbAXxosL8dzNI4t1nZVo/6xKCU6T5rryYbIc2LReKcnpqeqbgNMdx/ODBHKo18ot3Q5Bf68oSjUNf1wL0uK4+Fdp0UKKH1+0oyJDh8EUzRF7MN/4vrsBQI03/1v0uwkAefTlIJcel2Ll0znFKwczc9A4pfuvIh6AfSAzS4++H2+62ORfqn+Qjob/498jfsR50WSmDtGteHbkxxTryihOTJZegdVH6PzPvzfRdsyibbgkIC5TPsFycmiX+Vdsif76/7tU3IqklMGybCuvB1KubzKGbNpXLh4nH+yKUb3+vUJev5jOmXqQEHyxFaaAN9mgfFt1TMzVuuplNstGx1IMWeL5RP9e8MUymnBbCllrSMJXfl2uAK23i9XtaXP686r5Wc+WExj7HzSPU3o+WeubLn0TjPx1iWn0WHES/RN5nx5b+XMZCcIgoXm3J2IgkzuowVP2HN0avtVZrYSvSPBnR8qL8yuMVUSIpYJCuCzGZSUy0Q7o6ThrS0aktClV5n/GfFmnl9jrr+vhb7mk9lk/awi/gg5GhnQ/0olP3YGrnNOwUx5mP6tL7jsRur/m66mOpU3y4AAnM6CGSauurkJtqGPbAICwe8O97UIhc1d4g+flMiGlOekuW/Rp74berJKqw3nsJas4+Lvt9Q65z81WAMIivvsfsc4IndXXjfd/FfG/Gy h7xxwXpe MHlGjmWji6lVRvmcbmICjPq8JT//5kMV6jOkszsC5PLuRkfj5kYiHJAOOhAH9aad0v9gdcW7a7C8cnTd5wE1EIRh4gO5gWUwEduWdepyCdeS6kzamu4dS4J4t3o2POg6KXzGK6rPOJpwByObWqMk5Zq+q7VMuXT8NUAJVLlPUrNZyZPHllQeffsCFUmLIiG6yMUXHl72dNGXyoUVzRoUkqitsUk1X6JxCI1nkoouVYuXulUddAdmJA/7RhoIFJ26o6xOLUUTCcN3yT0CaYZyydctpwkKtsi/sb89KLVDSsU/PLyEtDB+qLTuSlWZxd4+cWU7j6QhogOT6mAV4KpFba9jIrofcGBjDfZNuygTZX4yb1suuJmuRRDUTHZ5bfuP3ak/S+/ElNuNBX3ygWcFVyKoFIQ2YbX1OQQch6raOCQVZ8l/mRJrQtTVJeNMxjhwy3tPhoJjgvRmb0Y69w4m5OTdY0w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 20, 2024 at 6:36=E2=80=AFPM Huang, Ying = wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Thu, Jun 20, 2024 at 5:22=E2=80=AFPM Huang, Ying wrote: > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > On Thu, Jun 20, 2024 at 1:55=E2=80=AFPM Huang, Ying wrote: > >> >> > >> >> Barry Song <21cnbao@gmail.com> writes: > >> >> > >> >> > From: Barry Song > >> >> > > >> >> > Both Ryan and Chris have been utilizing the small test program to= aid > >> >> > in debugging and identifying issues with swap entry allocation. W= hile > >> >> > a real or intricate workload might be more suitable for assessing= the > >> >> > correctness and effectiveness of the swap allocation policy, a sm= all > >> >> > test program presents a simpler means of understanding the proble= m and > >> >> > initially verifying the improvements being made. > >> >> > > >> >> > Let's endeavor to integrate it into the self-test suite. Although= it > >> >> > presently only accommodates 64KB and 4KB, I'm optimistic that we = can > >> >> > expand its capabilities to support multiple sizes and simulate mo= re > >> >> > complex systems in the future as required. > >> >> > >> >> IIUC, this is a performance test program instead of functionality t= est > >> >> program. Does it match the purpose of the kernel selftest? > >> > > >> > I have a differing perspective. I maintain that the functionality is > >> > not functioning > >> > as expected. Despite having all the necessary resources for allocati= on, failure > >> > persists, indicating a lack of functionality. > >> > >> Is there any user visual functionality issue? > > > > Definitely not. If a plane can't take off, taking a train and pretendin= g > > there's no functionality issue isn't a solution. > > I always think that performance optimization is great work. However, it > is not functionality work. > > > I have never assigned blame for any mistakes here. On the contrary, > > I have 100% appreciation for Ryan's work in at least initiating mTHP > > swapout w/o being split. > > > > It took countless experiments for humans to make airplanes commercially > > viable, but the person who created the first flying airplane remains th= e > > greatest. Similarly, Ryan's efforts, combined with your review of his p= atch, > > have enabled us to achieve a better goal here. Without your work, we ca= n't > > get here at all. > > Thanks! > > > However, this is never a reason to refuse to acknowledge that this feat= ure > > is not actually working. > > It just works for some workloads, not for some others. > > >> > >> >> > >> >> > Signed-off-by: Barry Song > >> >> > --- > >> >> > tools/testing/selftests/mm/Makefile | 1 + > >> >> > .../selftests/mm/thp_swap_allocator_test.c | 192 ++++++++++++= ++++++ > >> >> > 2 files changed, 193 insertions(+) > >> >> > create mode 100644 tools/testing/selftests/mm/thp_swap_allocator= _test.c > >> >> > > >> >> > diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/= selftests/mm/Makefile > >> >> > index e1aa09ddaa3d..64164ad66835 100644 > >> >> > --- a/tools/testing/selftests/mm/Makefile > >> >> > +++ b/tools/testing/selftests/mm/Makefile > >> >> > @@ -65,6 +65,7 @@ TEST_GEN_FILES +=3D mseal_test > >> >> > TEST_GEN_FILES +=3D seal_elf > >> >> > TEST_GEN_FILES +=3D on-fault-limit > >> >> > TEST_GEN_FILES +=3D pagemap_ioctl > >> >> > +TEST_GEN_FILES +=3D thp_swap_allocator_test > >> >> > TEST_GEN_FILES +=3D thuge-gen > >> >> > TEST_GEN_FILES +=3D transhuge-stress > >> >> > TEST_GEN_FILES +=3D uffd-stress > >> >> > diff --git a/tools/testing/selftests/mm/thp_swap_allocator_test.c= b/tools/testing/selftests/mm/thp_swap_allocator_test.c > >> >> > new file mode 100644 > >> >> > index 000000000000..4443a906d0f8 > >> >> > --- /dev/null > >> >> > +++ b/tools/testing/selftests/mm/thp_swap_allocator_test.c > >> >> > @@ -0,0 +1,192 @@ > >> >> > +// SPDX-License-Identifier: GPL-2.0-or-later > >> >> > +/* > >> >> > + * thp_swap_allocator_test > >> >> > + * > >> >> > + * The purpose of this test program is helping check if THP swpo= ut > >> >> > + * can correctly get swap slots to swap out as a whole instead o= f > >> >> > + * being split. It randomly releases swap entries through madvis= e > >> >> > + * DONTNEED and do swapout on two memory areas: a memory area fo= r > >> >> > + * 64KB THP and the other area for small folios. The second memo= ry > >> >> > + * can be enabled by "-s". > >> >> > + * Before running the program, we need to setup a zRAM or simila= r > >> >> > + * swap device by: > >> >> > + * echo lzo > /sys/block/zram0/comp_algorithm > >> >> > + * echo 64M > /sys/block/zram0/disksize > >> >> > + * echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2= 048kB/enabled > >> >> > + * echo always > /sys/kernel/mm/transparent_hugepage/hugepages-= 64kB/enabled > >> >> > + * mkswap /dev/zram0 > >> >> > + * swapon /dev/zram0 > >> >> > + * The expected result should be 0% anon swpout fallback ratio w= / or > >> >> > + * w/o "-s". > >> >> > + * > >> >> > + * Author(s): Barry Song > >> >> > + */ > >> >> > + > >> >> > +#define _GNU_SOURCE > >> >> > +#include > >> >> > +#include > >> >> > +#include > >> >> > +#include > >> >> > +#include > >> >> > +#include > >> >> > +#include > >> >> > + > >> >> > +#define MEMSIZE_MTHP (60 * 1024 * 1024) > >> >> > +#define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024) > >> >> > +#define ALIGNMENT_MTHP (64 * 1024) > >> >> > +#define ALIGNMENT_SMALLFOLIO (4 * 1024) > >> >> > +#define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024) > >> >> > +#define TOTAL_DONTNEED_SMALLFOLIO (768 * 1024) > >> >> > +#define MTHP_FOLIO_SIZE (64 * 1024) > >> >> > + > >> >> > +#define SWPOUT_PATH \ > >> >> > + "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/s= wpout" > >> >> > +#define SWPOUT_FALLBACK_PATH \ > >> >> > + "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/s= wpout_fallback" > >> >> > + > >> >> > +static void *aligned_alloc_mem(size_t size, size_t alignment) > >> >> > +{ > >> >> > + void *mem =3D NULL; > >> >> > + > >> >> > + if (posix_memalign(&mem, alignment, size) !=3D 0) { > >> >> > + perror("posix_memalign"); > >> >> > + return NULL; > >> >> > + } > >> >> > + return mem; > >> >> > +} > >> >> > + > >> >> > +static void random_madvise_dontneed(void *mem, size_t mem_size, > >> >> > + size_t align_size, size_t total_dontneed_size) > >> >> > +{ > >> >> > + size_t num_pages =3D total_dontneed_size / align_size; > >> >> > + size_t i; > >> >> > + size_t offset; > >> >> > + void *addr; > >> >> > + > >> >> > + for (i =3D 0; i < num_pages; ++i) { > >> >> > + offset =3D (rand() % (mem_size / align_size)) * ali= gn_size; > >> >> > + addr =3D (char *)mem + offset; > >> >> > + if (madvise(addr, align_size, MADV_DONTNEED) !=3D 0= ) > >> >> > + perror("madvise dontneed"); > >> >> > >> >> IIUC, this simulates align_size (generally 64KB) swap-in. That is,= it > >> >> simulate the effect of large size swap-in when it's not available i= n > >> >> kernel. If we have large size swap-in in kernel in the future, thi= s > >> >> becomes unnecessary. > >> >> > >> >> Additionally, we have not reached the consensus that we should alwa= ys > >> >> swap-in with swapped-out size. So, I suspect that this test may no= t > >> >> reflect real situation in the future. Although it doesn't reflect > >> >> current situation too. > >> > > >> > Disagree again. releasing the whole mTHP swaps is the best case. Eve= n in > >> > the best-case scenario, if we fail, it raises concerns for handling = potentially > >> > more challenging situations. > >> > >> Repeating sequential anonymous pages writing is the best case. > > > > I define the best case as the scenario with the least chance of creatin= g > > fragments within swapfiles for mTHP to swap out. There is no real > > difference whether this is done through swapin or madv_dontneed. > > IMO, swapin is much more important than madv_dontneed. Because most > users use swapin automatically, but few use madv_dontneed by hand. So, > I think swapin/swapout test is much more important than madv_dontneed. > I don't like this test case because madv_dontneed isn't typical or > basic. Disliking DONTNEED isn't a sufficient reason to reject this test program be= cause no single small program can report swapout counters, swapout fallback count= ers, and fallback ratios within several minutes for 100 iterations. That's precisely why we need it, at least initially. We can enhance it further if it lacks certain functionalities that people desire. The entire purpose of MADV_DONTNEED is to simulate a scenario where all slots are released as a whole, preventing the creation of fragments, which = is most favorable for swap allocation. I believe there is no difference betwee= n using MADV_DONTNEED or swapin for this purpose. But I am perfectly fine with switching to swapin to replace MADV_DONTNEED in v2. I will simply replace DONTNEED by swapping in all 16 subpages every time as the initial commit, as I anticipate that this approach will yield the best test results. I anticipate that the optimization process will comprise three steps in tot= al. 1. If our swapin process doesn't generate fragments(always swapin all subpa= ges), we achieve a 0% fallback ratio with Chris's and Ryan's current optimization= s. 2. With the current optimizations from Chris and Ryan, we achieve a fallback ratio of less than 50% when generating fragments during swapping in by randomly swapping in a portion of subpages. The positive outcome is that we tested Ryan's V1 on an actual phone that sw= aps in by small folios at 50% percentage (because we have 50% chance to fallbac= k while allocating mTHP within do_swap_page()). Despite this, we still achiev= ed a 0% fallback ratio when using two zRAMs: one for small folios and the othe= r for large folios. My assumption is that anonymous memory still maintains good spatial locality, allowing all subpages to eventually be accessed even though they = are swapped in one by one. So finally fragments are removed sooner or later. 3. We still maintain a 0% fallback ratio with Chris's long-term plan to opt= imize swapout, even using non-discontiguous slots. I actually don't find it difficult if we can save a swap offset in subpage's field. But obviously people don't like this because the trend is to remove subpage's stuff as much as possible = :-) > > >> > >> > I don't find it hard to incorporate additional features into this te= st > >> > program to simulate more intricate scenarios. > >> > >> IMHO, we don't really need this special purpose test. We can have som= e > >> more general basic tests, for example, sequential anonymous pages > >> writing/reading, random anonymous pages writing/reading, and combinati= on > >> of them. > > > > I understand that not all things will be loved by all people. However, = before > > I sent this patch, Chris mentioned that it has been very helpful for hi= m and > > strongly suggested that I contribute it to the self-test suite. > > > > By the way, adding sequential and random anonymous pages for > > read/write operations is definitely in my plan. The absence of this fea= ture > > isn't a convincing reason to disregard it. > > > > [snip] > > -- > Best Regards, > Huang, Ying Thanks Barry