From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5909EC27C6E for ; Mon, 17 Jun 2024 07:08:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DE5516B0138; Mon, 17 Jun 2024 03:08:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D95F36B0147; Mon, 17 Jun 2024 03:08:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C35926B0148; Mon, 17 Jun 2024 03:08:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A64A66B0138 for ; Mon, 17 Jun 2024 03:08:42 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 330191613F7 for ; Mon, 17 Jun 2024 07:08:42 +0000 (UTC) X-FDA: 82239502884.03.6BFF431 Received: from mail-ua1-f54.google.com (mail-ua1-f54.google.com [209.85.222.54]) by imf29.hostedemail.com (Postfix) with ESMTP id 67D6D12000C for ; Mon, 17 Jun 2024 07:08:40 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SFQZ+Ujm; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718608115; a=rsa-sha256; cv=none; b=PADBu1YjFelgXNiK1h7B7XBK6ktmGzUiwEh1kqiZ9eKAHsCtisi9cLgbHkw8F2TB+x+fHk Pxiv8Ju8SjVSTS6EPYHK27nmnrxc79/73stHPi70Xem90e8DzLK6ybBCUeQZI5bkPs0EMQ bnRKqv5tAOmFTC8KvPlpmqEq3pXH8TA= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SFQZ+Ujm; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718608115; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YTSUxbZ58gO2uuyvGy7K8h6s7XmT/ufVF/lSatLTiZQ=; b=MAPkRjBGMcAoJJI28KYjWKNPaXkEyy/Im0pJFAuZW5TG6Bm6f+BM3XOMq2xpkFLpHC/79o M+KXuMd2I6l9L2Jl4qg8xQWoWdyXSNC7qKX3tnOnCAoQP6IX2X/A8Yn5nCN9aDDEd35eZG M/v0xLB9/ZQEkei7VWZ1bqbWG8x9uRI= Received: by mail-ua1-f54.google.com with SMTP id a1e0cc1a2514c-80b9c393cd3so902736241.3 for ; Mon, 17 Jun 2024 00:08:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718608119; x=1719212919; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=YTSUxbZ58gO2uuyvGy7K8h6s7XmT/ufVF/lSatLTiZQ=; b=SFQZ+UjmDlI2E2xP3f/b/JVyzx5IEsFTQvltQ1Us2aOVo4PhSjMfgjmIcz4ZdAcYOZ y6Ra6ND8ULT+iHIuz4koEvFm5/nxXgrsI8G8bn2+bKjY1hyFYYhYSk85M8hrRUSF5a6V CL/NHUPKTG121DcF4eHUHs7hHEdf1BVX5hdoW0AhFBqwdgD8bF8QqIeZcz0SuYj6kHbs 5CpvCqiRWRGoUYiGO+Ll/6F5EvThh+71MfFONCaYvmS2B2i6JgpSqh3bFMFR7BFcxz9e R0KGxTVmb8Rnu912zTJdbEVJvrYp1MhBVC1KHyLLBeeGSvIpNNk2RPEHGPya+wiovtz1 SFlA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718608119; x=1719212919; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YTSUxbZ58gO2uuyvGy7K8h6s7XmT/ufVF/lSatLTiZQ=; b=jsZHVcAycJDNdSSxEjRZ1dqywuBjloO8gcfDST5i9r1yuntsAaWx6o8e+9i868JQyp JUehXiMjUqmBRAL8Vu+LYbY+nPuUFXCMw1U0UKBLKZKtrJ/WKvNdV1JJF6gNKWY4B9RZ 2mQ5dGfojph/aETSr8SATEeB0rlXdqaO6ctd0GcuPhn834O2oNKZ8UyU8IjABaE/hlqs G9NhOawozQL88Ge/a8tVUMPlihNkoG79WLx3NuvbKDs9VfcB+gZiwZWi3+0ypOuXeMKG EUN5+zBGkJgDm5uRdcpuoHzdm1oAtP8o9oKURiz13KO6++aHKILvYUHU/zHK11wpgrBq w4rA== X-Forwarded-Encrypted: i=1; AJvYcCVVf/G7pyLgdhKqzr5l2s2qXFHIyP+yd8R4t4dZBB21VIcyVkyDhel7mnX7ddXyedIcwBjGUN2yBl6YIewiibqBtpk= X-Gm-Message-State: AOJu0Yyk1hWdHzuC9+wxTP4CSefWsEwQqqQ5A5MkTnRe4ZoMQd7U5u1H Iep7X9aULTGRcNyFarPcVsgBqiCc1aQE+x54P1up7OWVV7ggDDxfctFTXwCqJT0rzfE9JONfJd6 947G/1xXdVb8MSwEkcr9dIHIv4Zc= X-Google-Smtp-Source: AGHT+IGyXYkkwcBZ9+XOW52yrHvXrTX56lCWtABQ3YIC/2n4rgpCVy/f5/LrDDazRcCjCeSG4YVkTm+EkfxDrfTfuQ0= X-Received: by 2002:a05:6102:9c7:b0:48d:96fa:f0a7 with SMTP id ada2fe7eead31-48dae32b935mr8051525137.2.1718608119323; Mon, 17 Jun 2024 00:08:39 -0700 (PDT) MIME-Version: 1.0 References: <20240614195921.a20f1766a78b27339a2a3128@linux-foundation.org> <20240615084714.37499-1-21cnbao@gmail.com> <87bk405akl.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87bk405akl.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Barry Song <21cnbao@gmail.com> Date: Mon, 17 Jun 2024 19:08:27 +1200 Message-ID: Subject: Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order To: "Huang, Ying" Cc: akpm@linux-foundation.org, chrisl@kernel.org, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, ryan.roberts@arm.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 67D6D12000C X-Stat-Signature: 4chtmjtnor5rcfadf3s5i6eimznrjmhd X-Rspam-User: X-HE-Tag: 1718608120-181490 X-HE-Meta: U2FsdGVkX1/iZzFqBGze99WwYxT65HFwX/b0N2LroTy0jVd7y6yiO1UKpSvmXWBTGs2HczxHKnXpTJztM2ipqPJz0JPAhYb4kDgBr5Yct709I9p42FIHSYUF/XPIQf64tFBcyzB5AXlUQV0u//YAETHggKjJ4dJwQsZQwQXGzX3b0TkDLmiwaWFbgx34TXUVGYR+Xez9F1r3/SOqa4pzx5ueRkPdeiq4ht6ICus6GNpp7UzuSVAQ634p5cndnJaUzBTuUks71Lsyu5C64MzxZIBrcDsz07BO9QMszU0FsqqSp/7VALw8WxCXiZbOXLmhtkQZ1dn5jQBD+zZ0aFBaHqM7FL3EEGa+0cITQ9NUgw+Vc/ogDK0Tsjqk+PKS2MWlr1toh7A7WQHC+EBZZl1dvCrbxg1Vlyh2mYzejvz10AwRfRtNYYo8bd6nnSdk+fakKGpuoAV4ND3wLQEgKbqCSM3yITVFtVgFsdWjQwnQdFJPwQSikIejHRT8ji88Eake1JaXiHWwAUQzoa3jNG5rxoeZjcgD0j+qx6GZgDzki2uiSymg7grubJ9DyIU9rM0CeFEIBKU/lPJPcnUN4a31a6nrex6/juXknNKr6MdUCNuFwp9rwWIIAhaYHSvj2fXGDJoz7B2zDTsvf8tNIt1qZaaYUGx/u9KZrDdJKNxF6btLZWHWPDbShoDBYoTs/fF6hQ6/H1Lo8E6t/FlbnNYv+DpQe8LhN4NCQipwqlRlGr3oGDnf61oEL/r5m+lQebIylAqJrlJY5UMn9chCXKNBiVv4SNASYJFjYagRF5BbXOGdOq0Wh0eXjjJ2pRFxXnjKmPSbakEG69b7clbKXxqoGJ1q/l2wVSET2rINwLJDyjD6cpn8Cwbe/K0dowQfHZxJ0MiA7IcQ5sSpwO8RCwWIPbe/7Y3ChgcB+CA0vXBY8ECU+5IVF+ym/LEluWtJRhxAvuzyIVTRrBZ1xM7UKe8 ri4H112g n4Jg1Crooi4sqfp3nnsCGNM1b7jsW8xsxaI3tV5qaGeyZwCiNCDnRmkizzRHVNGnPGbHPvmCldl+qHN51TH9E0tb3M91h/UtVPc1nyrsfK5SkyPxkZNH61WyJ8kpvwiKtemRtS+4hzrZUiVr5NN/LMwwyUXaGkCTdYcVAVmapCfgTQc4HYNV7bkijCkfXAGEhnzpuJ6ntplPgPAdrAvdzwLUYsKQ/InxfVvZ0ZG7AL5DIgjgQsgYT4E/8Rgyy1FpvFQ20yrCaQlkTcVXWfqTEt9Cgd/chF/nkLqE9oTrMWhuo1Y5B7qcTjOgjjzEQyfa5FcRXw9p4gAb14q8zvsOiosF82qnQtqyaamxKzcnW/qgn/+qJpYvg201VOGx07ssMR3L1jNwbnObwRgMKJYDhLqtGuQRORs6Z+nfmbPFE9+Js+9GZSaey681RXgP1/HfheRmnOTV5XRFBvwM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 17, 2024 at 6:50=E2=80=AFPM Huang, Ying = wrote: > > Hi, Barry, > > Barry Song <21cnbao@gmail.com> writes: > > > On Sat, Jun 15, 2024 at 2:59=E2=80=AFPM Andrew Morton wrote: > >> > >> On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li wrote: > >> > >> > > I'm having trouble understanding the overall impact of this on use= rs. > >> > > We fail the mTHP swap allocation and fall back, but things continu= e to > >> > > operate OK? > >> > > >> > Continue to operate OK in the sense that the mTHP will have to split > >> > into 4K pages before the swap out, aka the fall back. The swap out a= nd > >> > swap in can continue to work as 4K pages, not as the mTHP. Due to th= e > >> > fallback, the mTHP based zsmalloc compression with 64K buffer will n= ot > >> > happen. That is the effect of the fallback. But mTHP swap out and sw= ap > >> > in is relatively new, it is not really a regression. > >> > >> Sure, but it's pretty bad to merge a new feature only to have it > >> ineffective after a few hours use. > >> > >> > > > >> > > > There is some test number in the V1 thread of this series: > >> > > > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423= b26@kernel.org > >> > > > >> > > Well, please let's get the latest numbers into the latest patchset= . > >> > > Along with a higher-level (and quantitative) description of the us= er impact. > >> > > >> > I will need Barray's help to collect the number. I don't have the > >> > setup to reproduce his test result. > >> > Maybe a follow up commit message amendment for the test number when = I get it? > > > > Although the issue may seem complex at a systemic level, even a small p= rogram can > > demonstrate the problem and highlight how Chris's patch has improved th= e > > situation. > > > > To demonstrate this, I designed a basic test program that maximally all= ocates > > two memory blocks: > > > > * A memory block of up to 60MB, recommended for HUGEPAGE usage > > * A memory block of up to 1MB, recommended for NOHUGEPAGE usage > > > > In the system configuration, I enabled 64KB mTHP and 64MB zRAM, providi= ng more than > > enough space for both the 60MB and 1MB allocations in the worst case. T= his setup > > allows us to assess two effects: > > > > 1. When we don't enable mem2 (small folios), we consistently allocate = and free > > swap slots aligned with 64KB. whether there is a risk of failure t= o obtain > > swap slots even though the zRAM has sufficient free space? > > 2. When we enable mem2 (small folios), the presence of small folios ma= y lead > > to fragmentation of clusters, potentially impacting the swapout pro= cess for > > large folios negatively. > > > > (2) can be enabled by "-s", without -s, small folios are disabled. > > > > The script to configure zRAM and mTHP: > > > > echo lzo > /sys/block/zram0/comp_algorithm > > echo 64M > /sys/block/zram0/disksize > > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabl= ed > > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enable= d > > mkswap /dev/zram0 > > swapon /dev/zram0 > > > > The test program I made today after receiving Chris' patchset v2 > > > > (Andrew, Please let me know if you want this small test program to > > be committed into kernel/tools/ folder. If yes, please let me know, > > and I will cleanup and prepare a patch): > > > > #define _GNU_SOURCE > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > > > #define MEMSIZE_MTHP (60 * 1024 * 1024) > > #define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024) > > #define ALIGNMENT_MTHP (64 * 1024) > > #define ALIGNMENT_SMALLFOLIO (4 * 1024) > > #define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024) > > #define TOTAL_DONTNEED_SMALLFOLIO (256 * 1024) > > #define MTHP_FOLIO_SIZE (64 * 1024) > > > > #define SWPOUT_PATH \ > > "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout" > > #define SWPOUT_FALLBACK_PATH \ > > "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fa= llback" > > > > static void *aligned_alloc_mem(size_t size, size_t alignment) > > { > > void *mem =3D NULL; > > if (posix_memalign(&mem, alignment, size) !=3D 0) { > > perror("posix_memalign"); > > return NULL; > > } > > return mem; > > } > > > > static void random_madvise_dontneed(void *mem, size_t mem_size, > > size_t align_size, size_t total_do= ntneed_size) > > { > > size_t num_pages =3D total_dontneed_size / align_size; > > size_t i; > > size_t offset; > > void *addr; > > > > for (i =3D 0; i < num_pages; ++i) { > > offset =3D (rand() % (mem_size / align_size)) * align_size; > > addr =3D (char *)mem + offset; > > if (madvise(addr, align_size, MADV_DONTNEED) !=3D 0) { > > perror("madvise dontneed"); > > } > > memset(addr, 0x11, align_size); > > } > > } > > > > static unsigned long read_stat(const char *path) > > { > > FILE *file; > > unsigned long value; > > > > file =3D fopen(path, "r"); > > if (!file) { > > perror("fopen"); > > return 0; > > } > > > > if (fscanf(file, "%lu", &value) !=3D 1) { > > perror("fscanf"); > > fclose(file); > > return 0; > > } > > > > fclose(file); > > return value; > > } > > > > int main(int argc, char *argv[]) > > { > > int use_small_folio =3D 0; > > int i; > > void *mem1 =3D aligned_alloc_mem(MEMSIZE_MTHP, ALIGNMENT_MTHP); > > if (mem1 =3D=3D NULL) { > > fprintf(stderr, "Failed to allocate 60MB memory\n"); > > return EXIT_FAILURE; > > } > > > > if (madvise(mem1, MEMSIZE_MTHP, MADV_HUGEPAGE) !=3D 0) { > > perror("madvise hugepage for mem1"); > > free(mem1); > > return EXIT_FAILURE; > > } > > > > for (i =3D 1; i < argc; ++i) { > > if (strcmp(argv[i], "-s") =3D=3D 0) { > > use_small_folio =3D 1; > > } > > } > > > > void *mem2 =3D NULL; > > if (use_small_folio) { > > mem2 =3D aligned_alloc_mem(MEMSIZE_SMALLFOLIO, ALIGNMENT_MTHP); > > if (mem2 =3D=3D NULL) { > > fprintf(stderr, "Failed to allocate 1MB memory\n"); > > free(mem1); > > return EXIT_FAILURE; > > } > > > > if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_NOHUGEPAGE) !=3D 0) = { > > perror("madvise nohugepage for mem2"); > > free(mem1); > > free(mem2); > > return EXIT_FAILURE; > > } > > } > > > > for (i =3D 0; i < 100; ++i) { > > unsigned long initial_swpout; > > unsigned long initial_swpout_fallback; > > unsigned long final_swpout; > > unsigned long final_swpout_fallback; > > unsigned long swpout_inc; > > unsigned long swpout_fallback_inc; > > double fallback_percentage; > > > > initial_swpout =3D read_stat(SWPOUT_PATH); > > initial_swpout_fallback =3D read_stat(SWPOUT_FALLBACK_PATH); > > > > random_madvise_dontneed(mem1, MEMSIZE_MTHP, ALIGNMENT_MTHP, > > TOTAL_DONTNEED_MTHP); > > > > if (use_small_folio) { > > random_madvise_dontneed(mem2, MEMSIZE_SMALLFOLIO, > > ALIGNMENT_SMALLFOLIO, > > TOTAL_DONTNEED_SMALLFOLIO); > > } > > > > if (madvise(mem1, MEMSIZE_MTHP, MADV_PAGEOUT) !=3D 0) { > > perror("madvise pageout for mem1"); > > free(mem1); > > if (mem2 !=3D NULL) { > > free(mem2); > > } > > return EXIT_FAILURE; > > } > > > > if (use_small_folio) { > > if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_PAGEOUT) !=3D 0)= { > > perror("madvise pageout for mem2"); > > free(mem1); > > free(mem2); > > return EXIT_FAILURE; > > } > > } > > > > final_swpout =3D read_stat(SWPOUT_PATH); > > final_swpout_fallback =3D read_stat(SWPOUT_FALLBACK_PATH); > > > > swpout_inc =3D final_swpout - initial_swpout; > > swpout_fallback_inc =3D final_swpout_fallback - initial_swpout_= fallback; > > > > fallback_percentage =3D (double)swpout_fallback_inc / > > (swpout_fallback_inc + swpout_inc) * 100; > > > > printf("Iteration %d: swpout inc: %lu, swpout fallback inc: %lu= , Fallback percentage: %.2f%%\n", > > i + 1, swpout_inc, swpout_fallback_inc, fallback_percent= age); > > } > > > > free(mem1); > > if (mem2 !=3D NULL) { > > free(mem2); > > } > > > > return EXIT_SUCCESS; > > } > > Thank you very for your effort to write this test program. > > TBH, personally, I thought that this test program isn't practical > enough. Can we show performance difference with some normal workloads? Right. The whole purpose of this small program is to demonstrate the problem in the current code - even swap slots are always allocated and released aligned with mTHP, the current mainline will soon get 100% fallback ratio though swap space is enough, and swap slots are not fragmented at all. as long as we lose empty clusters, we lose the chance to do mthp swapout. We are still running tests using real Android phones with real workloads, a= nd will update you with the result hopefully this week. I am a little worried = that the triggered WARN_ONCE will lead to the failure of the test. > > [snip] > > -- > Best Regards, > Huang, Ying Thanks Barry