From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D6DB8C27C6E
	for <linux-mm@archiver.kernel.org>; Mon, 17 Jun 2024 06:50:25 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0B8C36B013D; Mon, 17 Jun 2024 02:50:25 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 069406B013F; Mon, 17 Jun 2024 02:50:25 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E4BE56B0141; Mon, 17 Jun 2024 02:50:24 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id C76AE6B013D
	for <linux-mm@kvack.org>; Mon, 17 Jun 2024 02:50:24 -0400 (EDT)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 1CEC3141359
	for <linux-mm@kvack.org>; Mon, 17 Jun 2024 06:50:24 +0000 (UTC)
X-FDA: 82239456768.12.B86BCED
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7])
	by imf29.hostedemail.com (Postfix) with ESMTP id 586A8120012
	for <linux-mm@kvack.org>; Mon, 17 Jun 2024 06:50:21 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="gIC5Ve/s";
	spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1718607019;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=/rmxpdVyNWSR0E9jX9QK19jUFaSjPjepIukGVCGYkjk=;
	b=z6FLjS2zgjFfnmRAcwnh3flrtV2Q81tMzAlYEqkDZtZzSnxi/5/4DfDTG+vX+mZXAG7Blb
	iK5qIE+jdMcpMtmMPdfKWbni9cyF6Ntz/MJ+40NRvRkQhvNz5iy1v0Uk6lWLLbvk4aXxBu
	PgAPCoYgKYBCTH6wE6G1vR0ZKKs8LwQ=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="gIC5Ve/s";
	spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718607019; a=rsa-sha256;
	cv=none;
	b=fWEYH7cvG6Y4oC0SBNd0pf178rY+CgYkbbhKU+j36ZdMM8/UtP1JGi09ylbttcYKFlfn7U
	HpKNEAJ4wk3DpQlmvZowBLl9oVizyzZ6HCUf5/gRAEb7QkEeUUdHIam1H1ZZm21My7VDX0
	tuywh1xM/m4hInsMjo0+dCu56n+03as=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1718607022; x=1750143022;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=CE7z+x81nQmJ5TtpcjQf6If5Uv0G2DH5yhSx+OQ1SSo=;
  b=gIC5Ve/sFk8sN2wk/76NjO6KEBiKX4ZPZemq9bf7YEYNKfGI/oOySSjM
   pvHMOUeFhR40N2BU5ObyLwmJ/QHPHNRZHus1WUKaJQOF8iZSRVgAtfBJy
   zAqtfzWDVjDzEPh3UBnpilgLTDW1MddetfJkZXvuQjdKc0Th3TO+R82D7
   zezb1dGB/nZrJUbqqPwejJy+j1CfJU0vDQ5pY8lB5gSqyQ8Dn5ixh/1Fu
   5Zl0Wy14EDBRAEZKyJDpWI8qj2Ta5vImPGJqBqJHQQHT/pADdjHq6KKv7
   NEq3gOxyoNckuybHF+XhuNfEqO+cFDXhkY3oEnyeVE5izo3aWE+2qLVGC
   A==;
X-CSE-ConnectionGUID: 6iOXfyUNSSS5ORJOkoAm1A==
X-CSE-MsgGUID: Dh3ITYznSzOAu2T5EISRMg==
X-IronPort-AV: E=McAfee;i="6700,10204,11105"; a="40835114"
X-IronPort-AV: E=Sophos;i="6.08,244,1712646000"; 
   d="scan'208";a="40835114"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Jun 2024 23:50:19 -0700
X-CSE-ConnectionGUID: RBtH8HKPT6SphxZWcE9fjw==
X-CSE-MsgGUID: kSZncDKMS4KFjzHgnKCz9A==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,244,1712646000"; 
   d="scan'208";a="40982900"
Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Jun 2024 23:50:17 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Barry Song <21cnbao@gmail.com>
Cc: akpm@linux-foundation.org,  chrisl@kernel.org,  baohua@kernel.org,
  kaleshsingh@google.com,  kasong@tencent.com,
  linux-kernel@vger.kernel.org,  linux-mm@kvack.org,  ryan.roberts@arm.com
Subject: Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap
 cluster order
In-Reply-To: <20240615084714.37499-1-21cnbao@gmail.com> (Barry Song's message
	of "Sat, 15 Jun 2024 20:47:14 +1200")
References: <20240614195921.a20f1766a78b27339a2a3128@linux-foundation.org>
	<20240615084714.37499-1-21cnbao@gmail.com>
Date: Mon, 17 Jun 2024 14:48:26 +0800
Message-ID: <87bk405akl.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 586A8120012
X-Stat-Signature: gif4tf9eichfhu1fr87jajnbx949kdx8
X-HE-Tag: 1718607021-410222
X-HE-Meta: U2FsdGVkX192GTDgJVwJZtMxtINUqqpFgE7QxTMMO3yDiB9eN9sF7QXEF2NaA0D8/KMBXN4zurl7hYWVzntJzX2Us3d8nXrfQcR1mLLEIGw4ZwkYn+lQV+XLju4o30ruKomyh2v/QdsufJI2P/AhenM5g0o/E23kP2AR1SNs/nGIxRL/ttwyCPqk9c0BqYqpt35Ah/96S+hJFflE5Qs5vwIwmWuuyuZKEZKDGhL0UAg3sJmVMoyzCd5z34kmS1y8TEuWf0C3o3c/YMJ7oicVQUAd7nKYPT2roL9N9gWjPK4BeaLIRB6N1jI/BnhEmf7ckXyR26BWbAbDtuldaPwT/ZWl/S/7G+pLmF0KTFCXtnavvpImXIQH2DxUpWoC5V5IPvuKJEK91fDkcfHrvWLQWAsQ7d68A2Jj6E5ZTr8+2DIrDOnzbcnmp25xJXe0gL5OWCoXbItonfOw/OIfY491AdH8JaUvIoyPO5Ij7GX63Yr1R+94YQ9PdsbR/RyljHqfiqqroog4QziZy5wl9LHFZKCSgwL8T6w7EJtHcf+7CdW9bABAOPAfY3N8VG+dCL/PChB5gLrdB2MOJbm9Ecg+ObhuHPi2cmuBdylRH0cXNOS0mzBSt2Caokr/lNoJCdlFXZfkFzilklhga7dlHr2E5O8E/Zc5vgmwOqk7T9yICWZL40h1wA9MIDMCwo1cNQhfe4O+F2bbYsLhehzN4ncQJJC5Q6/XBosGQ9UK0s+yEC1DrTtESeizWfp0eqa55fuls6ciSbhz1tN2j9KQ7gWc1NNiGDbPj5xSnf2dG1V7sK7PyA04FwjhCDVy3BbK2dCocUfTJccRtblb249rlEbU4GPOEVWB3Xqx6VpTTJKBwjg3Kjl1wl+Lry9HLgCMbZrwpM56FxEXVb24UCp3KVqm7Mi9OYV7f/mCDjFVa1pTZj3CeNRGtHiheR2bbADN66Wqxl4YA+N33Xm9cHxYDe9
 CyGRCdnT
 +8ZoiUdq+/v446zmKw41N/b9ZpB48BGPMTUJl5kDBEv+SNKJgmnHLXCLgmBen+DMNFcgGS0V25SlGIg1okbAqhGFKLAPvVIHgS8ENlZcbpgQOCOYLTuQR/flVqa2J/SrKwldxGUrwWUMTerouv3HXcqg5eNaDIDjjQp6MaDxMfYCM8S2mwwAuLFERmK9J6oZnjf4X6bUDXSJFweEJuNE/N0kDyZooK68Bq1zGXfTV27pFY9Fk8ftHo92P8CZ8ybym0S+elSc/q5WfEhP55pr/QGDVnzN/vGlojbIY
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi, Barry,

Barry Song <21cnbao@gmail.com> writes:

> On Sat, Jun 15, 2024 at 2:59=E2=80=AFPM Andrew Morton <akpm@linux-foundat=
ion.org> wrote:
>>
>> On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@kernel.org> wrote:
>>
>> > > I'm having trouble understanding the overall impact of this on users.
>> > > We fail the mTHP swap allocation and fall back, but things continue =
to
>> > > operate OK?
>> >
>> > Continue to operate OK in the sense that the mTHP will have to split
>> > into 4K pages before the swap out, aka the fall back. The swap out and
>> > swap in can continue to work as 4K pages, not as the mTHP. Due to the
>> > fallback, the mTHP based zsmalloc compression with 64K buffer will not
>> > happen. That is the effect of the fallback. But mTHP swap out and swap
>> > in is relatively new, it is not really a regression.
>>
>> Sure, but it's pretty bad to merge a new feature only to have it
>> ineffective after a few hours use.
>>
>> > >
>> > > > There is some test number in the V1 thread of this series:
>> > > > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b2=
6@kernel.org
>> > >
>> > > Well, please let's get the latest numbers into the latest patchset.
>> > > Along with a higher-level (and quantitative) description of the user=
 impact.
>> >
>> > I will need Barray's help to collect the number. I don't have the
>> > setup to reproduce his test result.
>> > Maybe a follow up commit message amendment for the test number when I =
get it?
>
> Although the issue may seem complex at a systemic level, even a small pro=
gram can
> demonstrate the problem and highlight how Chris's patch has improved the
> situation.
>
> To demonstrate this, I designed a basic test program that maximally alloc=
ates
> two memory blocks:
>
>  *   A memory block of up to 60MB, recommended for HUGEPAGE usage
>  *   A memory block of up to 1MB, recommended for NOHUGEPAGE usage
>
> In the system configuration, I enabled 64KB mTHP and 64MB zRAM, providing=
 more than
> enough space for both the 60MB and 1MB allocations in the worst case. Thi=
s setup
> allows us to assess two effects:
>
> 1.  When we don't enable mem2 (small folios), we consistently allocate an=
d free
>     swap slots aligned with 64KB.  whether there is a risk of failure to =
obtain
>     swap slots even though the zRAM has sufficient free space?
> 2.  When we enable mem2 (small folios), the presence of small folios may =
lead
>     to fragmentation of clusters, potentially impacting the swapout proce=
ss for
>     large folios negatively.
>
> (2) can be enabled by "-s", without -s, small folios are disabled.
>
> The script to configure zRAM and mTHP:
>
> echo lzo > /sys/block/zram0/comp_algorithm
> echo 64M > /sys/block/zram0/disksize
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> mkswap /dev/zram0
> swapon /dev/zram0
>
> The test program I made today after receiving Chris' patchset v2
>
> (Andrew, Please let me know if you want this small test program to
> be committed into kernel/tools/ folder. If yes, please let me know,
> and I will cleanup and prepare a patch):
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <errno.h>
> #include <time.h>
>
> #define MEMSIZE_MTHP (60 * 1024 * 1024)
> #define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024)
> #define ALIGNMENT_MTHP (64 * 1024)
> #define ALIGNMENT_SMALLFOLIO (4 * 1024)
> #define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024)
> #define TOTAL_DONTNEED_SMALLFOLIO (256 * 1024)
> #define MTHP_FOLIO_SIZE (64 * 1024)
>
> #define SWPOUT_PATH \
>     "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout"
> #define SWPOUT_FALLBACK_PATH \
>     "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fall=
back"
>
> static void *aligned_alloc_mem(size_t size, size_t alignment)
> {
>     void *mem =3D NULL;
>     if (posix_memalign(&mem, alignment, size) !=3D 0) {
>         perror("posix_memalign");
>         return NULL;
>     }
>     return mem;
> }
>
> static void random_madvise_dontneed(void *mem, size_t mem_size,
>                                      size_t align_size, size_t total_dont=
need_size)
> {
>     size_t num_pages =3D total_dontneed_size / align_size;
>     size_t i;
>     size_t offset;
>     void *addr;
>
>     for (i =3D 0; i < num_pages; ++i) {
>         offset =3D (rand() % (mem_size / align_size)) * align_size;
>         addr =3D (char *)mem + offset;
>         if (madvise(addr, align_size, MADV_DONTNEED) !=3D 0) {
>             perror("madvise dontneed");
>         }
>         memset(addr, 0x11, align_size);
>     }
> }
>
> static unsigned long read_stat(const char *path)
> {
>     FILE *file;
>     unsigned long value;
>
>     file =3D fopen(path, "r");
>     if (!file) {
>         perror("fopen");
>         return 0;
>     }
>
>     if (fscanf(file, "%lu", &value) !=3D 1) {
>         perror("fscanf");
>         fclose(file);
>         return 0;
>     }
>
>     fclose(file);
>     return value;
> }
>
> int main(int argc, char *argv[])
> {
>     int use_small_folio =3D 0;
>     int i;
>     void *mem1 =3D aligned_alloc_mem(MEMSIZE_MTHP, ALIGNMENT_MTHP);
>     if (mem1 =3D=3D NULL) {
>         fprintf(stderr, "Failed to allocate 60MB memory\n");
>         return EXIT_FAILURE;
>     }
>
>     if (madvise(mem1, MEMSIZE_MTHP, MADV_HUGEPAGE) !=3D 0) {
>         perror("madvise hugepage for mem1");
>         free(mem1);
>         return EXIT_FAILURE;
>     }
>
>     for (i =3D 1; i < argc; ++i) {
>         if (strcmp(argv[i], "-s") =3D=3D 0) {
>             use_small_folio =3D 1;
>         }
>     }
>
>     void *mem2 =3D NULL;
>     if (use_small_folio) {
>         mem2 =3D aligned_alloc_mem(MEMSIZE_SMALLFOLIO, ALIGNMENT_MTHP);
>         if (mem2 =3D=3D NULL) {
>             fprintf(stderr, "Failed to allocate 1MB memory\n");
>             free(mem1);
>             return EXIT_FAILURE;
>         }
>
>         if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_NOHUGEPAGE) !=3D 0) {
>             perror("madvise nohugepage for mem2");
>             free(mem1);
>             free(mem2);
>             return EXIT_FAILURE;
>         }
>     }
>
>     for (i =3D 0; i < 100; ++i) {
>         unsigned long initial_swpout;
>         unsigned long initial_swpout_fallback;
>         unsigned long final_swpout;
>         unsigned long final_swpout_fallback;
>         unsigned long swpout_inc;
>         unsigned long swpout_fallback_inc;
>         double fallback_percentage;
>
>         initial_swpout =3D read_stat(SWPOUT_PATH);
>         initial_swpout_fallback =3D read_stat(SWPOUT_FALLBACK_PATH);
>
>         random_madvise_dontneed(mem1, MEMSIZE_MTHP, ALIGNMENT_MTHP,
>                                  TOTAL_DONTNEED_MTHP);
>
>         if (use_small_folio) {
>             random_madvise_dontneed(mem2, MEMSIZE_SMALLFOLIO,
>                                      ALIGNMENT_SMALLFOLIO,
>                                      TOTAL_DONTNEED_SMALLFOLIO);
>         }
>
>         if (madvise(mem1, MEMSIZE_MTHP, MADV_PAGEOUT) !=3D 0) {
>             perror("madvise pageout for mem1");
>             free(mem1);
>             if (mem2 !=3D NULL) {
>                 free(mem2);
>             }
>             return EXIT_FAILURE;
>         }
>
>         if (use_small_folio) {
>             if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_PAGEOUT) !=3D 0) {
>                 perror("madvise pageout for mem2");
>                 free(mem1);
>                 free(mem2);
>                 return EXIT_FAILURE;
>             }
>         }
>
>         final_swpout =3D read_stat(SWPOUT_PATH);
>         final_swpout_fallback =3D read_stat(SWPOUT_FALLBACK_PATH);
>
>         swpout_inc =3D final_swpout - initial_swpout;
>         swpout_fallback_inc =3D final_swpout_fallback - initial_swpout_fa=
llback;
>
>         fallback_percentage =3D (double)swpout_fallback_inc /
>                               (swpout_fallback_inc + swpout_inc) * 100;
>
>         printf("Iteration %d: swpout inc: %lu, swpout fallback inc: %lu, =
Fallback percentage: %.2f%%\n",
>                i + 1, swpout_inc, swpout_fallback_inc, fallback_percentag=
e);
>     }
>
>     free(mem1);
>     if (mem2 !=3D NULL) {
>         free(mem2);
>     }
>
>     return EXIT_SUCCESS;
> }

Thank you very for your effort to write this test program.

TBH, personally, I thought that this test program isn't practical
enough.  Can we show performance difference with some normal workloads?

[snip]

--
Best Regards,
Huang, Ying