From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6DB8C27C6E for ; Mon, 17 Jun 2024 06:50:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0B8C36B013D; Mon, 17 Jun 2024 02:50:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 069406B013F; Mon, 17 Jun 2024 02:50:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E4BE56B0141; Mon, 17 Jun 2024 02:50:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id C76AE6B013D for ; Mon, 17 Jun 2024 02:50:24 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 1CEC3141359 for ; Mon, 17 Jun 2024 06:50:24 +0000 (UTC) X-FDA: 82239456768.12.B86BCED Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by imf29.hostedemail.com (Postfix) with ESMTP id 586A8120012 for ; Mon, 17 Jun 2024 06:50:21 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="gIC5Ve/s"; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718607019; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/rmxpdVyNWSR0E9jX9QK19jUFaSjPjepIukGVCGYkjk=; b=z6FLjS2zgjFfnmRAcwnh3flrtV2Q81tMzAlYEqkDZtZzSnxi/5/4DfDTG+vX+mZXAG7Blb iK5qIE+jdMcpMtmMPdfKWbni9cyF6Ntz/MJ+40NRvRkQhvNz5iy1v0Uk6lWLLbvk4aXxBu PgAPCoYgKYBCTH6wE6G1vR0ZKKs8LwQ= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="gIC5Ve/s"; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718607019; a=rsa-sha256; cv=none; b=fWEYH7cvG6Y4oC0SBNd0pf178rY+CgYkbbhKU+j36ZdMM8/UtP1JGi09ylbttcYKFlfn7U HpKNEAJ4wk3DpQlmvZowBLl9oVizyzZ6HCUf5/gRAEb7QkEeUUdHIam1H1ZZm21My7VDX0 tuywh1xM/m4hInsMjo0+dCu56n+03as= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1718607022; x=1750143022; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=CE7z+x81nQmJ5TtpcjQf6If5Uv0G2DH5yhSx+OQ1SSo=; b=gIC5Ve/sFk8sN2wk/76NjO6KEBiKX4ZPZemq9bf7YEYNKfGI/oOySSjM pvHMOUeFhR40N2BU5ObyLwmJ/QHPHNRZHus1WUKaJQOF8iZSRVgAtfBJy zAqtfzWDVjDzEPh3UBnpilgLTDW1MddetfJkZXvuQjdKc0Th3TO+R82D7 zezb1dGB/nZrJUbqqPwejJy+j1CfJU0vDQ5pY8lB5gSqyQ8Dn5ixh/1Fu 5Zl0Wy14EDBRAEZKyJDpWI8qj2Ta5vImPGJqBqJHQQHT/pADdjHq6KKv7 NEq3gOxyoNckuybHF+XhuNfEqO+cFDXhkY3oEnyeVE5izo3aWE+2qLVGC A==; X-CSE-ConnectionGUID: 6iOXfyUNSSS5ORJOkoAm1A== X-CSE-MsgGUID: Dh3ITYznSzOAu2T5EISRMg== X-IronPort-AV: E=McAfee;i="6700,10204,11105"; a="40835114" X-IronPort-AV: E=Sophos;i="6.08,244,1712646000"; d="scan'208";a="40835114" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Jun 2024 23:50:19 -0700 X-CSE-ConnectionGUID: RBtH8HKPT6SphxZWcE9fjw== X-CSE-MsgGUID: kSZncDKMS4KFjzHgnKCz9A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,244,1712646000"; d="scan'208";a="40982900" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Jun 2024 23:50:17 -0700 From: "Huang, Ying" To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, chrisl@kernel.org, baohua@kernel.org, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, ryan.roberts@arm.com Subject: Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order In-Reply-To: <20240615084714.37499-1-21cnbao@gmail.com> (Barry Song's message of "Sat, 15 Jun 2024 20:47:14 +1200") References: <20240614195921.a20f1766a78b27339a2a3128@linux-foundation.org> <20240615084714.37499-1-21cnbao@gmail.com> Date: Mon, 17 Jun 2024 14:48:26 +0800 Message-ID: <87bk405akl.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 586A8120012 X-Stat-Signature: gif4tf9eichfhu1fr87jajnbx949kdx8 X-HE-Tag: 1718607021-410222 X-HE-Meta: U2FsdGVkX192GTDgJVwJZtMxtINUqqpFgE7QxTMMO3yDiB9eN9sF7QXEF2NaA0D8/KMBXN4zurl7hYWVzntJzX2Us3d8nXrfQcR1mLLEIGw4ZwkYn+lQV+XLju4o30ruKomyh2v/QdsufJI2P/AhenM5g0o/E23kP2AR1SNs/nGIxRL/ttwyCPqk9c0BqYqpt35Ah/96S+hJFflE5Qs5vwIwmWuuyuZKEZKDGhL0UAg3sJmVMoyzCd5z34kmS1y8TEuWf0C3o3c/YMJ7oicVQUAd7nKYPT2roL9N9gWjPK4BeaLIRB6N1jI/BnhEmf7ckXyR26BWbAbDtuldaPwT/ZWl/S/7G+pLmF0KTFCXtnavvpImXIQH2DxUpWoC5V5IPvuKJEK91fDkcfHrvWLQWAsQ7d68A2Jj6E5ZTr8+2DIrDOnzbcnmp25xJXe0gL5OWCoXbItonfOw/OIfY491AdH8JaUvIoyPO5Ij7GX63Yr1R+94YQ9PdsbR/RyljHqfiqqroog4QziZy5wl9LHFZKCSgwL8T6w7EJtHcf+7CdW9bABAOPAfY3N8VG+dCL/PChB5gLrdB2MOJbm9Ecg+ObhuHPi2cmuBdylRH0cXNOS0mzBSt2Caokr/lNoJCdlFXZfkFzilklhga7dlHr2E5O8E/Zc5vgmwOqk7T9yICWZL40h1wA9MIDMCwo1cNQhfe4O+F2bbYsLhehzN4ncQJJC5Q6/XBosGQ9UK0s+yEC1DrTtESeizWfp0eqa55fuls6ciSbhz1tN2j9KQ7gWc1NNiGDbPj5xSnf2dG1V7sK7PyA04FwjhCDVy3BbK2dCocUfTJccRtblb249rlEbU4GPOEVWB3Xqx6VpTTJKBwjg3Kjl1wl+Lry9HLgCMbZrwpM56FxEXVb24UCp3KVqm7Mi9OYV7f/mCDjFVa1pTZj3CeNRGtHiheR2bbADN66Wqxl4YA+N33Xm9cHxYDe9 CyGRCdnT +8ZoiUdq+/v446zmKw41N/b9ZpB48BGPMTUJl5kDBEv+SNKJgmnHLXCLgmBen+DMNFcgGS0V25SlGIg1okbAqhGFKLAPvVIHgS8ENlZcbpgQOCOYLTuQR/flVqa2J/SrKwldxGUrwWUMTerouv3HXcqg5eNaDIDjjQp6MaDxMfYCM8S2mwwAuLFERmK9J6oZnjf4X6bUDXSJFweEJuNE/N0kDyZooK68Bq1zGXfTV27pFY9Fk8ftHo92P8CZ8ybym0S+elSc/q5WfEhP55pr/QGDVnzN/vGlojbIY X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, Barry, Barry Song <21cnbao@gmail.com> writes: > On Sat, Jun 15, 2024 at 2:59=E2=80=AFPM Andrew Morton wrote: >> >> On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li wrote: >> >> > > I'm having trouble understanding the overall impact of this on users. >> > > We fail the mTHP swap allocation and fall back, but things continue = to >> > > operate OK? >> > >> > Continue to operate OK in the sense that the mTHP will have to split >> > into 4K pages before the swap out, aka the fall back. The swap out and >> > swap in can continue to work as 4K pages, not as the mTHP. Due to the >> > fallback, the mTHP based zsmalloc compression with 64K buffer will not >> > happen. That is the effect of the fallback. But mTHP swap out and swap >> > in is relatively new, it is not really a regression. >> >> Sure, but it's pretty bad to merge a new feature only to have it >> ineffective after a few hours use. >> >> > > >> > > > There is some test number in the V1 thread of this series: >> > > > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b2= 6@kernel.org >> > > >> > > Well, please let's get the latest numbers into the latest patchset. >> > > Along with a higher-level (and quantitative) description of the user= impact. >> > >> > I will need Barray's help to collect the number. I don't have the >> > setup to reproduce his test result. >> > Maybe a follow up commit message amendment for the test number when I = get it? > > Although the issue may seem complex at a systemic level, even a small pro= gram can > demonstrate the problem and highlight how Chris's patch has improved the > situation. > > To demonstrate this, I designed a basic test program that maximally alloc= ates > two memory blocks: > > * A memory block of up to 60MB, recommended for HUGEPAGE usage > * A memory block of up to 1MB, recommended for NOHUGEPAGE usage > > In the system configuration, I enabled 64KB mTHP and 64MB zRAM, providing= more than > enough space for both the 60MB and 1MB allocations in the worst case. Thi= s setup > allows us to assess two effects: > > 1. When we don't enable mem2 (small folios), we consistently allocate an= d free > swap slots aligned with 64KB. whether there is a risk of failure to = obtain > swap slots even though the zRAM has sufficient free space? > 2. When we enable mem2 (small folios), the presence of small folios may = lead > to fragmentation of clusters, potentially impacting the swapout proce= ss for > large folios negatively. > > (2) can be enabled by "-s", without -s, small folios are disabled. > > The script to configure zRAM and mTHP: > > echo lzo > /sys/block/zram0/comp_algorithm > echo 64M > /sys/block/zram0/disksize > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > mkswap /dev/zram0 > swapon /dev/zram0 > > The test program I made today after receiving Chris' patchset v2 > > (Andrew, Please let me know if you want this small test program to > be committed into kernel/tools/ folder. If yes, please let me know, > and I will cleanup and prepare a patch): > > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > > #define MEMSIZE_MTHP (60 * 1024 * 1024) > #define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024) > #define ALIGNMENT_MTHP (64 * 1024) > #define ALIGNMENT_SMALLFOLIO (4 * 1024) > #define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024) > #define TOTAL_DONTNEED_SMALLFOLIO (256 * 1024) > #define MTHP_FOLIO_SIZE (64 * 1024) > > #define SWPOUT_PATH \ > "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout" > #define SWPOUT_FALLBACK_PATH \ > "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fall= back" > > static void *aligned_alloc_mem(size_t size, size_t alignment) > { > void *mem =3D NULL; > if (posix_memalign(&mem, alignment, size) !=3D 0) { > perror("posix_memalign"); > return NULL; > } > return mem; > } > > static void random_madvise_dontneed(void *mem, size_t mem_size, > size_t align_size, size_t total_dont= need_size) > { > size_t num_pages =3D total_dontneed_size / align_size; > size_t i; > size_t offset; > void *addr; > > for (i =3D 0; i < num_pages; ++i) { > offset =3D (rand() % (mem_size / align_size)) * align_size; > addr =3D (char *)mem + offset; > if (madvise(addr, align_size, MADV_DONTNEED) !=3D 0) { > perror("madvise dontneed"); > } > memset(addr, 0x11, align_size); > } > } > > static unsigned long read_stat(const char *path) > { > FILE *file; > unsigned long value; > > file =3D fopen(path, "r"); > if (!file) { > perror("fopen"); > return 0; > } > > if (fscanf(file, "%lu", &value) !=3D 1) { > perror("fscanf"); > fclose(file); > return 0; > } > > fclose(file); > return value; > } > > int main(int argc, char *argv[]) > { > int use_small_folio =3D 0; > int i; > void *mem1 =3D aligned_alloc_mem(MEMSIZE_MTHP, ALIGNMENT_MTHP); > if (mem1 =3D=3D NULL) { > fprintf(stderr, "Failed to allocate 60MB memory\n"); > return EXIT_FAILURE; > } > > if (madvise(mem1, MEMSIZE_MTHP, MADV_HUGEPAGE) !=3D 0) { > perror("madvise hugepage for mem1"); > free(mem1); > return EXIT_FAILURE; > } > > for (i =3D 1; i < argc; ++i) { > if (strcmp(argv[i], "-s") =3D=3D 0) { > use_small_folio =3D 1; > } > } > > void *mem2 =3D NULL; > if (use_small_folio) { > mem2 =3D aligned_alloc_mem(MEMSIZE_SMALLFOLIO, ALIGNMENT_MTHP); > if (mem2 =3D=3D NULL) { > fprintf(stderr, "Failed to allocate 1MB memory\n"); > free(mem1); > return EXIT_FAILURE; > } > > if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_NOHUGEPAGE) !=3D 0) { > perror("madvise nohugepage for mem2"); > free(mem1); > free(mem2); > return EXIT_FAILURE; > } > } > > for (i =3D 0; i < 100; ++i) { > unsigned long initial_swpout; > unsigned long initial_swpout_fallback; > unsigned long final_swpout; > unsigned long final_swpout_fallback; > unsigned long swpout_inc; > unsigned long swpout_fallback_inc; > double fallback_percentage; > > initial_swpout =3D read_stat(SWPOUT_PATH); > initial_swpout_fallback =3D read_stat(SWPOUT_FALLBACK_PATH); > > random_madvise_dontneed(mem1, MEMSIZE_MTHP, ALIGNMENT_MTHP, > TOTAL_DONTNEED_MTHP); > > if (use_small_folio) { > random_madvise_dontneed(mem2, MEMSIZE_SMALLFOLIO, > ALIGNMENT_SMALLFOLIO, > TOTAL_DONTNEED_SMALLFOLIO); > } > > if (madvise(mem1, MEMSIZE_MTHP, MADV_PAGEOUT) !=3D 0) { > perror("madvise pageout for mem1"); > free(mem1); > if (mem2 !=3D NULL) { > free(mem2); > } > return EXIT_FAILURE; > } > > if (use_small_folio) { > if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_PAGEOUT) !=3D 0) { > perror("madvise pageout for mem2"); > free(mem1); > free(mem2); > return EXIT_FAILURE; > } > } > > final_swpout =3D read_stat(SWPOUT_PATH); > final_swpout_fallback =3D read_stat(SWPOUT_FALLBACK_PATH); > > swpout_inc =3D final_swpout - initial_swpout; > swpout_fallback_inc =3D final_swpout_fallback - initial_swpout_fa= llback; > > fallback_percentage =3D (double)swpout_fallback_inc / > (swpout_fallback_inc + swpout_inc) * 100; > > printf("Iteration %d: swpout inc: %lu, swpout fallback inc: %lu, = Fallback percentage: %.2f%%\n", > i + 1, swpout_inc, swpout_fallback_inc, fallback_percentag= e); > } > > free(mem1); > if (mem2 !=3D NULL) { > free(mem2); > } > > return EXIT_SUCCESS; > } Thank you very for your effort to write this test program. TBH, personally, I thought that this test program isn't practical enough. Can we show performance difference with some normal workloads? [snip] -- Best Regards, Huang, Ying