From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B2884C04FFE
	for <linux-mm@archiver.kernel.org>; Sat, 11 May 2024 06:56:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 459586B0164; Sat, 11 May 2024 02:56:53 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 408A56B0165; Sat, 11 May 2024 02:56:53 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2D0586B0166; Sat, 11 May 2024 02:56:53 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 0C9FA6B0164
	for <linux-mm@kvack.org>; Sat, 11 May 2024 02:56:53 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 9DAC21406DC
	for <linux-mm@kvack.org>; Sat, 11 May 2024 06:56:52 +0000 (UTC)
X-FDA: 82105207464.30.B022B89
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8])
	by imf04.hostedemail.com (Postfix) with ESMTP id 8AA3140007
	for <linux-mm@kvack.org>; Sat, 11 May 2024 06:56:49 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=hYOXJZOU;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf04.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1715410610;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=AtE7s0HA0Whqfry60j3soj+HOWEKXMAqRwxnakkBP2w=;
	b=jsVQwdyCkEXrr20J+swOORnfUknKK4f/JhV/mADZporjP59Pr41VH7kifDA8LVbkxYA9aj
	nLcqYAaurR6rbX3KhBgi3JgXSjffhAf+7vlpw8C3nwiiJQYNurI9d6ihTaRLwn+zahLvkx
	VoOMs5zti3Pby1dcokzk0RCys4YdJLg=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=hYOXJZOU;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf04.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715410610; a=rsa-sha256;
	cv=none;
	b=6z2YfHX+415d4ybhvJmdhM1Fde/pkyx83jUo9tcjyrWuSw2xZg99q3a+/Y2scoQmxI8vB0
	f3o8x8jseBk1wz5qJ35r8uLwmx47cqPEgTOkYAO+X2BBulImTHkLrMLVMO9w1xi2OLZ7zl
	IEzXsq5Gu7LS8RlxgjNlySv3K9f42wQ=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715410610; x=1746946610;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=Tw9GdLH0x8UaRYiCZPWx/D+dO79BPWpUuq0Q5QOsMos=;
  b=hYOXJZOUShFzg036Prvua4K+fb0MfPSC4OtRCt40IDfg78cE8h9FqAxH
   q4GnHUW6s7SmfWWjLGyZJI+AS2jaAf9rB9TplmbRb3kul4GrBZeZq4ESo
   ZhjY0+Gq6IG1B1H/AEnOrRYOXv3H/3j4e8LnOWp9y7KmoSkBvb8K1mfNk
   d2ZKDsidvgw79viGlLAoJ4LcKDe6A6cB8cVbbj5GpynmuE5a10hJKLbFE
   sNBW2mEC76M87gIKKM+VLS+DeigfZlXYzd14WOIQLra+Q6hKFpipLj/0u
   MjuhUdw4P3KglJBSZT7NSXyiCSvJdzsm26FLMewF4fsCg0gOfa4OAWOrV
   g==;
X-CSE-ConnectionGUID: 130diszfSTCFxyJ1VYpBSg==
X-CSE-MsgGUID: p5OYX/4MSlWrfvHUpvFr9g==
X-IronPort-AV: E=McAfee;i="6600,9927,11069"; a="28924483"
X-IronPort-AV: E=Sophos;i="6.08,153,1712646000"; 
   d="scan'208";a="28924483"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 May 2024 23:56:47 -0700
X-CSE-ConnectionGUID: 9gBgQQ6WSb6arC/viaeTQw==
X-CSE-MsgGUID: 8z5WsUzrQMGuP2Sirucm5g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,153,1712646000"; 
   d="scan'208";a="29873818"
Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 May 2024 23:56:43 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Byungchul Park <byungchul@sk.com>
Cc: <linux-kernel@vger.kernel.org>,  <linux-mm@kvack.org>,
  <kernel_team@skhynix.com>,  <akpm@linux-foundation.org>,
  <vernhao@tencent.com>,  <mgorman@techsingularity.net>,
  <hughd@google.com>,  <willy@infradead.org>,  <david@redhat.com>,
  <peterz@infradead.org>,  <luto@kernel.org>,  <tglx@linutronix.de>,
  <mingo@redhat.com>,  <bp@alien8.de>,  <dave.hansen@linux.intel.com>,
  <rjgolo@gmail.com>
Subject: Re: [PATCH v10 00/12] LUF(Lazy Unmap Flush) reducing tlb numbers
 over 90%
In-Reply-To: <20240510065206.76078-1-byungchul@sk.com> (Byungchul Park's
	message of "Fri, 10 May 2024 15:51:54 +0900")
References: <20240510065206.76078-1-byungchul@sk.com>
Date: Sat, 11 May 2024 14:54:51 +0800
Message-ID: <87ikzkg7no.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 8AA3140007
X-Stat-Signature: o13t6mo8dkbo4cac7nrmbubxc6zwoad7
X-HE-Tag: 1715410609-361361
X-HE-Meta: U2FsdGVkX18wzWWxJyTaB3r+ZR5amN7NFj9KYWoLswi2Vpzk3fxsnO8K1xTRMC1Wl8MJKodwhez80WsEzzV0IPJ92JV2Ih5pqeCD7s/7jxi57LD0hb/tlpN9qDk+TdPUryGKwO369esuaLBfmjup0fmYqIgPxF072d7CrgKXUK+ne066c3mt1XBlQaq8wp2KC6IHfUI9xe207F5yNvL76bH0hWCKLb64XxMNpGru7RUiqrrOeXOj4F8X3vPnKOBRYDEnpRSMzuf6SyGtCklwUU9BvkSGg8tQGW6hikp3wLWUWk4tMm5pCrepObd/7H14gJVitdy1Kzu0QTU4vroeQQIdfRuzFRsvKiHIpNhCAxLZK6tKpNioUdONO8xSZE2YpEp0eW6io34OBlBV3DVyZ3aadNFnEYEopNXNGIxb7YN1q7D88vsctpYwwKkWp1D0ohpb01OSctyNqOT5LR9IiTkVtjjwchRp+sgSY+Juz68wGPlLRpnre58DUV0PFQ9mfBHjsa/zDbbDX1KIbGh9bk8Y5K+h4teiceGnYLCcs5xXLxixNdoo6V0cA91vR+fdcjk/5AIVkRWdKV48627GU22j6KRFjSu7pldNsVj9ZoJNhCwXp5fcBNfV8LDpBPmZ1L8U6orAR5LJ5Aa1nrbfX+hcOBQ7aHoGFGaYaDAx+8urZigCQcE8/9b3OuaiCc7lJ5q794u+tSdhNtHWdHnmPvzKP+/8J63fUanrzVrjm6piyLBPaa/p9mz/hGQiEPHIdgH1RUtqgHMsL5vroDBWqNLelkPpuNBRAJwrR3KwHY0VbLxwN5jh+e2wssQq3nSX0yi11A87n8AGUN+6wvcYLxD14/zbmBe7E8pJBXb/IRBZ5YfAKm2T9z1qaXACN4np68cC9Ej6MZgcfgE/WKWJlaWLalDhn3IYGmaPUhSxpWVbMzaOBmbK6ZX4O89P+iDab3OZ+jLryrDfsI2Esjg
 fNz6etLw
 vCBj0YQqhOxjE/4JIRLdn5fPyFEDnNN0PQHkHkAqTmduJR7p+zcjXsimje8Nqr1+4YZxgHXWDWLe2GLmozQEu+RrYsHmdMoklnWys3d5hhDxS585pr16ZjndRtl15JhyqPeMC1HDfeTnoabPcG2bFNJHr4vigzFlo2n6MCh9kcRkLtZl3aIf8wch1nA92PnHlzT0FotiGgglFSCH7wDZGgm6l08VeNg9/b00LGmNdWJDylV1XCUGzfteWn8Flq8X7TD1jGBPnb70iIEcd1b1gu+XUWtlg+qWOzN6ID9dwsAUgatJrp7eqb0o3CdicpP3Veg3OSomVA8sd5TU=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Byungchul Park <byungchul@sk.com> writes:

> Hi everyone,
>
> While I'm working with a tiered memory system e.g. CXL memory, I have
> been facing migration overhead esp. tlb shootdown on promotion or
> demotion between different tiers.  Yeah..  most tlb shootdowns on
> migration through hinting fault can be avoided thanks to Huang Ying's
> work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE
> is inaccessible").  See the following link for more information:
>
> https://lore.kernel.org/lkml/20231115025755.GA29979@system.software.com/
>
> However, it's only for migration through hinting fault.  I thought it'd
> be much better if we have a general mechanism to reduce all the tlb
> numbers that we can apply to any unmap code, that we normally believe
> tlb flush should be followed.
>
> I'm suggesting a new mechanism, LUF(Lazy Unmap Flush), defers tlb flush
> until folios that have been unmapped and freed, eventually get allocated
> again.  It's safe for folios that had been mapped read-only and were
> unmapped, since the contents of the folios don't change while staying in
> pcp or buddy so we can still read the data through the stale tlb entries.
>
> tlb flush can be defered when folios get unmapped as long as it
> guarantees to perform tlb flush needed, before the folios actually
> become used, of course, only if all the corresponding ptes don't have
> write permission.  Otherwise, the system will get messed up.
>
> To achieve that:
>
>    1. For the folios that map only to non-writable tlb entries, prevent
>       tlb flush during unmapping but perform it just before the folios
>       actually become used, out of buddy or pcp.
>
>    2. When any non-writable ptes change to writable e.g. through fault
>       handler, give up luf mechanism and perform tlb flush required
>       right away.
>
>    3. When a writable mapping is created e.g. through mmap(), give up
>       luf mechanism and perform tlb flush required right away.
>
> No matter what type of workload is used for performance evaluation, the
> result would be positive thanks to the unconditional reduction of tlb
> flushes, tlb misses and interrupts.

Are there any downsides of the optimization?  Will it cause regression
for workloads with almost no read-only mappings?  Will it cause
regression for page allocation?

> For the test, I picked up one of
> the most popular and heavy workload, llama.cpp that is a
> LLM(Large Language Model) inference engine.

IIUC, llama.cpp is a workload with huge read-only mapping.

> The result would depend on memory latency and how often reclaim runs,
> which implies tlb miss overhead and how many times unmapping happens.
> In my system, the result shows:
>
>    1. tlb flushes are reduced about 95%.
>    2. tlb misses(itlb) are reduced about 80%.
>    3. tlb misses(dtlb store) are reduced about 57%.
>    4. tlb misses(dtlb load) are reduced about 24%.
>    5. tlb shootdown interrupts are reduced about 95%.
>    6. The test program runtime is reduced about 5%.
>
> The test environment and the result is like:
>
>    Machine: bare metal, x86_64, Intel(R) Xeon(R) Gold 6430
>    CPU: 1 socket 64 core with hyper thread on
>    Numa: 2 nodes (64 CPUs DRAM 42GB, no CPUs CXL expander 98GB)
>    Config: swap off, numa balancing tiering on, demotion enabled
>
>    The test set:
>
>       llama.cpp/main -m $(70G_model1) -p "who are you?" -s 1 -t 15 -n 20 &
>       llama.cpp/main -m $(70G_model2) -p "who are you?" -s 1 -t 15 -n 20 &
>       llama.cpp/main -m $(70G_model3) -p "who are you?" -s 1 -t 15 -n 20 &
>       wait
>
>       where -t: nr of threads, -s: seed used to make the runtime stable,
>       -n: nr of tokens that determines the runtime, -p: prompt to ask,
>       -m: LLM model to use.
>
>    Run the test set 10 times successively with caches dropped every run
>    via 'echo 3 > /proc/sys/vm/drop_caches'.  Each inference prints its
>    runtime at the end of each.
>
>    1. Runtime from the output of llama.cpp:
>
>    BEFORE
>    ------
>    llama_print_timings:       total time = 1002461.95 ms /    24 tokens
>    llama_print_timings:       total time = 1044978.38 ms /    24 tokens
>    llama_print_timings:       total time = 1000653.09 ms /    24 tokens
>    llama_print_timings:       total time = 1047104.80 ms /    24 tokens
>    llama_print_timings:       total time = 1069430.36 ms /    24 tokens
>    llama_print_timings:       total time = 1068201.16 ms /    24 tokens
>    llama_print_timings:       total time = 1078092.59 ms /    24 tokens
>    llama_print_timings:       total time = 1073200.45 ms /    24 tokens
>    llama_print_timings:       total time = 1067136.00 ms /    24 tokens
>    llama_print_timings:       total time = 1076442.56 ms /    24 tokens
>    llama_print_timings:       total time = 1004142.64 ms /    24 tokens
>    llama_print_timings:       total time = 1042942.65 ms /    24 tokens
>    llama_print_timings:       total time =  999933.76 ms /    24 tokens
>    llama_print_timings:       total time = 1046548.83 ms /    24 tokens
>    llama_print_timings:       total time = 1068671.48 ms /    24 tokens
>    llama_print_timings:       total time = 1068285.76 ms /    24 tokens
>    llama_print_timings:       total time = 1077789.63 ms /    24 tokens
>    llama_print_timings:       total time = 1071558.93 ms /    24 tokens
>    llama_print_timings:       total time = 1066181.55 ms /    24 tokens
>    llama_print_timings:       total time = 1076767.53 ms /    24 tokens
>    llama_print_timings:       total time = 1004065.63 ms /    24 tokens
>    llama_print_timings:       total time = 1044522.13 ms /    24 tokens
>    llama_print_timings:       total time =  999725.33 ms /    24 tokens
>    llama_print_timings:       total time = 1047510.77 ms /    24 tokens
>    llama_print_timings:       total time = 1068010.27 ms /    24 tokens
>    llama_print_timings:       total time = 1068999.31 ms /    24 tokens
>    llama_print_timings:       total time = 1077648.05 ms /    24 tokens
>    llama_print_timings:       total time = 1071378.96 ms /    24 tokens
>    llama_print_timings:       total time = 1066326.32 ms /    24 tokens
>    llama_print_timings:       total time = 1077088.92 ms /    24 tokens
>
>    AFTER
>    -----
>    llama_print_timings:       total time =  988522.03 ms /    24 tokens
>    llama_print_timings:       total time =  997204.52 ms /    24 tokens
>    llama_print_timings:       total time =  996605.86 ms /    24 tokens
>    llama_print_timings:       total time =  991985.50 ms /    24 tokens
>    llama_print_timings:       total time = 1035143.31 ms /    24 tokens
>    llama_print_timings:       total time =  993660.18 ms /    24 tokens
>    llama_print_timings:       total time =  983082.14 ms /    24 tokens
>    llama_print_timings:       total time =  990431.36 ms /    24 tokens
>    llama_print_timings:       total time =  992707.09 ms /    24 tokens
>    llama_print_timings:       total time =  992673.27 ms /    24 tokens
>    llama_print_timings:       total time =  989285.43 ms /    24 tokens
>    llama_print_timings:       total time =  996710.06 ms /    24 tokens
>    llama_print_timings:       total time =  996534.64 ms /    24 tokens
>    llama_print_timings:       total time =  991344.17 ms /    24 tokens
>    llama_print_timings:       total time = 1035210.84 ms /    24 tokens
>    llama_print_timings:       total time =  994714.13 ms /    24 tokens
>    llama_print_timings:       total time =  984184.15 ms /    24 tokens
>    llama_print_timings:       total time =  990909.45 ms /    24 tokens
>    llama_print_timings:       total time =  991881.48 ms /    24 tokens
>    llama_print_timings:       total time =  993918.03 ms /    24 tokens
>    llama_print_timings:       total time =  990061.34 ms /    24 tokens
>    llama_print_timings:       total time =  998076.69 ms /    24 tokens
>    llama_print_timings:       total time =  997082.59 ms /    24 tokens
>    llama_print_timings:       total time =  990677.58 ms /    24 tokens
>    llama_print_timings:       total time = 1036054.94 ms /    24 tokens
>    llama_print_timings:       total time =  994125.93 ms /    24 tokens
>    llama_print_timings:       total time =  982467.01 ms /    24 tokens
>    llama_print_timings:       total time =  990191.60 ms /    24 tokens
>    llama_print_timings:       total time =  993319.24 ms /    24 tokens
>    llama_print_timings:       total time =  992540.57 ms /    24 tokens
>
>    2. tlb shootdowns from 'cat /proc/interrupts':
>
>    BEFORE
>    ------
>    TLB:
>    125553646  141418810  161932620  176853972  186655697  190399283
>    192143823  196414038  192872439  193313658  193395617  192521416
>    190788161  195067598  198016061  193607347  194293972  190786732
>    191545637  194856822  191801931  189634535  190399803  196365922
>    195268398  190115840  188050050  193194908  195317617  190820190
>    190164820  185556071  226797214  229592631  216112464  209909495
>    205575979  205950252  204948111  197999795  198892232  205287952
>    199344631  195015158  195869844  198858745  195692876  200961904
>    203463252  205921722  199850838  206145986  199613202  199961345
>    200129577  203020521  207873649  203697671  197093386  204243803
>    205993323  200934664  204193128  194435376  TLB shootdowns
>
>    AFTER
>    -----
>    TLB:
>      5648092    6610142    7032849    7882308    8088518    8352310
>      8656536    8705136    8647426    8905583    8985408    8704522
>      8884344    9026261    8929974    8869066    8877575    8810096
>      8770984    8754503    8801694    8865925    8787524    8656432
>      8755912    8682034    8773935    8832925    8797997    8515777
>      8481240    8891258   10595243   10285973    9756935    9573681
>      9398968    9069244    9242984    8899009    9310690    9029095
>      9069758    9105825    9092703    9270202    9460287    9258546
>      9180415    9232723    9270611    9175020    9490420    9360316
>      9420818    9057663    9525631    9310152    9152242    8654483
>      9181804    9050847    8919916    8883856  TLB shootdowns
>
>    3. tlb numbers from 'perf stat' per test set:
>
>    BEFORE
>    ------
>    3163679332	dTLB-load-misses
>    2017751856	dTLB-store-misses
>    327092903	iTLB-load-misses
>    1357543886	tlb:tlb_flush
>
>    AFTER
>    -----
>    2394694609	dTLB-load-misses
>    861144167	dTLB-store-misses
>    64055579	iTLB-load-misses
>    69175002	tlb:tlb_flush
>
> ---
>
> Changes from v9:
>
> 	1. Expand the candidate to apply this mechanism:
>
> 	   BEFORE - The souce folios at any type of migration.
> 	   AFTER  - Any folios that have been unmapped and freed.
>
> 	2. Change the workload for test:
>
> 	   BEFORE - XSBench
> 	   AFTER  - llama.cpp (one of the most popluar real workload)
>
> 	3. Change the test environment:
>
> 	   BEFORE - qemu machine, too small DRAM(1GB), large remote mem
> 	   AFTER  - bare metal, real CXL memory, practical memory size
>
> 	4. Rename the mechanism from MIGRC(Migration Read Copy) to
> 	   LUF(Lazy Unmap Flush) to reflect the current version of the
> 	   mechanism can be applied not only to unmap during migration
> 	   but any unmap code e.g. unmap in shrink_folio_list().
>
> 	5. Fix build error for riscv. (feedbacked by kernel test bot)
>
> 	6. Supplement commit messages to describe what this mechanism is
> 	   for, especially in the patches for arch code. (feedbacked by
> 	   Thomas Gleixner)
>
> 	7. Clean up some trivial things.
>
> Changes from v8:
>
> 	1. Rebase on akpm/mm.git mm-unstable as of April 18, 2024.
> 	2. Supplement comments and commit message.
> 	3. Change the candidate to apply migrc mechanism:
>
> 	   BEFORE - The source folios at demotion and promotion.
> 	   AFTER  - The souce folios at any type of migration.
>
> 	4. Change how migrc mechanism works:
>
> 	   BEFORE - Reduce tlb flushes by deferring folio_free() for
> 	            source folios during demotion and promotion.
> 	   AFTER  - Reduce tlb flushes by deferring tlb flush until they
> 	            actually become used, out of pcp or buddy. The
> 		    current version of migrc does *not* defer calling
> 	            folio_free() but let it go as it is as the same as
> 		    vanilla kernel, with the folios marked kind of 'need
> 		    to tlb flush'. And then handle the flush when the
> 		    page exits from pcp or buddy so as to prevent
> 		    changing vm stats e.g. free pages.
>
> Changes from v7:
>
> 	1. Rewrite cover letter to explain what 'migrc' mechasism is.
> 	   (feedbacked by Andrew Morton)
> 	2. Supplement the commit message of a patch 'mm: Add APIs to
> 	   free a folio directly to the buddy bypassing pcp'.
> 	   (feedbacked by Andrew Morton)
>
> Changes from v6:
>
> 	1. Fix build errors in case of
> 	   CONFIG_ARCH_WANT_BATCHED_UNMAP_tlb_FLUSH disabled by moving
> 	   migrc_flush_{start,end}() calls from arch code to
> 	   try_to_unmap_flush() in mm/rmap.c.
>
> Changes from v5:
>
> 	1. Fix build errors in case of CONFIG_MIGRATION disabled or
> 	   CONFIG_HWPOISON_INJECT moduled. (feedbacked by kernel test
> 	   bot and Raymond Jay Golo)
> 	2. Organize migrc code with two kconfigs, CONFIG_MIGRATION and
> 	   CONFIG_ARCH_WANT_BATCHED_UNMAP_tlb_FLUSH.
>
> Changes from v4:
>
> 	1. Rebase on v6.7.
> 	2. Fix build errors in arm64 that is doing nothing for tlb flush
> 	   but has CONFIG_ARCH_WANT_BATCHED_UNMAP_tlb_FLUSH. (reported
> 	   by kernel test robot)
> 	3. Don't use any page flag. So the system would give up migrc
> 	   mechanism more often but it's okay. The final improvement is
> 	   good enough.
> 	4. Instead, optimize full tlb flush(arch_tlbbatch_flush()) by
> 	   avoiding redundant CPUs from tlb flush.
>
> Changes from v3:
>
> 	1. Don't use the kconfig, CONFIG_MIGRC, and remove sysctl knob,
> 	   migrc_enable. (feedbacked by Nadav)
> 	2. Remove the optimization skipping CPUs that have already
> 	   performed tlb flushes needed by any reason when performing
> 	   tlb flushes by migrc because I can't tell the performance
> 	   difference between w/ the optimization and w/o that.
> 	   (feedbacked by Nadav)
> 	3. Minimize arch-specific code. While at it, move all the migrc
>            declarations and inline functions from include/linux/mm.h to
>            mm/internal.h (feedbacked by Dave Hansen, Nadav)
> 	4. Separate a part making migrc paused when the system is in
> 	   high memory pressure to another patch. (feedbacked by Nadav)
> 	5. Rename:
> 	      a. arch_tlbbatch_clean() to arch_tlbbatch_clear(),
> 	      b. tlb_ubc_nowr to tlb_ubc_ro,
> 	      c. migrc_try_flush_free_folios() to migrc_flush_free_folios(),
> 	      d. migrc_stop to migrc_pause.
> 	   (feedbacked by Nadav)
> 	6. Use ->lru list_head instead of introducing a new llist_head.
> 	   (feedbacked by Nadav)
> 	7. Use non-atomic operations of page-flag when it's safe.
> 	   (feedbacked by Nadav)
> 	8. Use stack instead of keeping a pointer of 'struct migrc_req'
> 	   in struct task, which is for manipulating it locally.
> 	   (feedbacked by Nadav)
> 	9. Replace a lot of simple functions to inline functions placed
> 	   in a header, mm/internal.h. (feedbacked by Nadav)
> 	10. Add additional sufficient comments. (feedbacked by Nadav)
> 	11. Remove a lot of wrapper functions. (feedbacked by Nadav)
>
> Changes from RFC v2:
>
> 	1. Remove additional occupation in struct page. To do that,
> 	   unioned with lru field for migrc's list and added a page
> 	   flag. I know page flag is a thing that we don't like to add
> 	   but no choice because migrc should distinguish folios under
> 	   migrc's control from others. Instead, I force migrc to be
> 	   used only on 64 bit system to mitigate you guys from getting
> 	   angry.
> 	2. Remove meaningless internal object allocator that I
> 	   introduced to minimize impact onto the system. However, a ton
> 	   of tests showed there was no difference.
> 	3. Stop migrc from working when the system is in high memory
> 	   pressure like about to perform direct reclaim. At the
> 	   condition where the swap mechanism is heavily used, I found
> 	   the system suffered from regression without this control.
> 	4. Exclude folios that pte_dirty() == true from migrc's interest
> 	   so that migrc can work simpler.
> 	5. Combine several patches that work tightly coupled to one.
> 	6. Add sufficient comments for better review.
> 	7. Manage migrc's request in per-node manner (from globally).
> 	8. Add tlb miss improvement in commit message.
> 	9. Test with more CPUs(4 -> 16) to see bigger improvement.
>
> Changes from RFC:
>
> 	1. Fix a bug triggered when a destination folio at the previous
> 	   migration becomes a source folio at the next migration,
> 	   before the folio gets handled properly so that the folio can
> 	   play with another migration. There was inconsistency in the
> 	   folio's state. Fixed it.
> 	2. Split the patch set into more pieces so that the folks can
> 	   review better. (Feedbacked by Nadav Amit)
> 	3. Fix a wrong usage of barrier e.g. smp_mb__after_atomic().
> 	   (Feedbacked by Nadav Amit)
> 	4. Tried to add sufficient comments to explain the patch set
> 	   better. (Feedbacked by Nadav Amit)
>
> Byungchul Park (12):
>   x86/tlb: add APIs manipulating tlb batch's arch data
>   arm64: tlbflush: add APIs manipulating tlb batch's arch data
>   riscv, tlb: add APIs manipulating tlb batch's arch data
>   x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of
>     arch_tlbbatch_flush()
>   mm: buddy: make room for a new variable, ugen, in struct page
>   mm: add folio_put_ugen() to deliver unmap generation number to pcp or
>     buddy
>   mm: add a parameter, unmap generation number, to free_unref_folios()
>   mm/rmap: recognize read-only tlb entries during batched tlb flush
>   mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get
>     unmapped
>   mm: separate move/undo parts from migrate_pages_batch()
>   mm, migrate: apply luf mechanism to unmapping during migration
>   mm, vmscan: apply luf mechanism to unmapping during folio reclaim
>
>  arch/arm64/include/asm/tlbflush.h |  18 ++
>  arch/riscv/include/asm/tlbflush.h |  21 ++
>  arch/riscv/mm/tlbflush.c          |   1 -
>  arch/x86/include/asm/tlbflush.h   |  18 ++
>  arch/x86/mm/tlb.c                 |   2 -
>  include/linux/mm.h                |  22 ++
>  include/linux/mm_types.h          |  40 +++-
>  include/linux/rmap.h              |   7 +-
>  include/linux/sched.h             |  11 +
>  mm/compaction.c                   |  10 +
>  mm/internal.h                     | 115 +++++++++-
>  mm/memory.c                       |   8 +
>  mm/migrate.c                      | 184 ++++++++++------
>  mm/mmap.c                         |   8 +
>  mm/page_alloc.c                   | 157 +++++++++++---
>  mm/page_isolation.c               |   6 +
>  mm/page_reporting.c               |  10 +
>  mm/rmap.c                         | 345 +++++++++++++++++++++++++++++-
>  mm/swap.c                         |  18 +-
>  mm/vmscan.c                       |  29 ++-
>  20 files changed, 904 insertions(+), 126 deletions(-)
>
>
> base-commit: f52bcd4a9f6058704a6f6b6b50418f579defd4fe

--
Best Regards,
Huang, Ying