From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4191C7EE2D for ; Mon, 27 Feb 2023 11:06:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F06ED6B0072; Mon, 27 Feb 2023 06:06:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E8F866B0073; Mon, 27 Feb 2023 06:06:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D08EF6B0074; Mon, 27 Feb 2023 06:06:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BC39A6B0072 for ; Mon, 27 Feb 2023 06:06:20 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 79A7A1C6200 for ; Mon, 27 Feb 2023 11:06:20 +0000 (UTC) X-FDA: 80512792920.26.BE817E0 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf05.hostedemail.com (Postfix) with ESMTP id DFD09100002 for ; Mon, 27 Feb 2023 11:06:16 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=KCnXYsH5; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=eqGX4JB5; spf=pass (imf05.hostedemail.com: domain of jack@suse.cz designates 195.135.220.29 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677495977; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DJzD7rCRYOb2Z0uIwvnb34swkGK5jvovR5ZJ2SZO6Y0=; b=12eGs5yyR3bWSa2HUQ/N7Oni24iJ4QpsCGeZzhPWqIHEIIHZMbiUocyY4OoWkZGBtSOTs1 ja62kstmfgjl7D/YBKIeP0TqZxNn971YzklFWy05ByVmyFBic8fpYS7gUdgj5SdcA2f+CV hVXgwQg7wYxBuuIKz86fRqutr1QzXt8= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=KCnXYsH5; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=eqGX4JB5; spf=pass (imf05.hostedemail.com: domain of jack@suse.cz designates 195.135.220.29 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677495977; a=rsa-sha256; cv=none; b=as0Az8zkirrVIuP+Tx/vrPYymed9p16Lcl84kt3wp2IkE/bn9tVpJeWrnX1TrezZKxXsSd 0CV5GZlBR4vCykvLtEJ2G1SrIbakYaRs9npdcr7f4q9qhJt+D+6A9j1j0j4wREp1FFlKnd FB1AAXSmEGaWltk2WDZvkv5TeRhW1ls= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 349B21F8D9; Mon, 27 Feb 2023 11:06:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1677495975; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=DJzD7rCRYOb2Z0uIwvnb34swkGK5jvovR5ZJ2SZO6Y0=; b=KCnXYsH5QjC7rGbRRh8h/5q7K00WTA6gGrcRzwVflQbYExGi3PMjBoXHwkBruEzd28hJTK p1Po3Rm2FIHo3XiJdehfJ9D3KEWkhIrsyTOZxAQiBlIcIhRpgmQ+kp6r4+1XRaEaELcT1g HclGfiZyEhwnJZsU4JzcW6pwQqD+Hko= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1677495975; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=DJzD7rCRYOb2Z0uIwvnb34swkGK5jvovR5ZJ2SZO6Y0=; b=eqGX4JB5WuOHwQ8m0+O9pe4jSRt5P/3RCEfrf8UWDFidkPdI+IYO3bDF5CHaGDWvMqVPAZ 3lhHEdCqiKOqmLCQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 20D0E13912; Mon, 27 Feb 2023 11:06:15 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id /5P6B6eO/GN8DQAAMHmgww (envelope-from ); Mon, 27 Feb 2023 11:06:15 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id AD0FDA06F2; Mon, 27 Feb 2023 12:06:14 +0100 (CET) Date: Mon, 27 Feb 2023 12:06:14 +0100 From: Jan Kara To: Hugh Dickins Cc: Huang Ying , Andrew Morton , Jan Kara , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Zi Yan , Yang Shi , Baolin Wang , Oscar Salvador , Matthew Wilcox , Bharata B Rao , Alistair Popple , Xin Hao , Minchan Kim , Mike Kravetz , Hyeonggon Yoo <42.hyeyoo@gmail.com> Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing Message-ID: <20230227110614.dngdub2j3exr6dfp@quack3> References: <20230213123444.155149-1-ying.huang@intel.com> <87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com> X-Stat-Signature: h9zkz1badyy9txzpem5rpg83q7yqbr8m X-Rspam-User: X-Rspamd-Queue-Id: DFD09100002 X-Rspamd-Server: rspam06 X-HE-Tag: 1677495976-291349 X-HE-Meta: U2FsdGVkX19tLjhXeoNSf9o0+71XjQBmLevae+8459uNonLEtU/Le6OHg4gVtOQzsQrzcFKmpmAT3UqCqFt8f6D6NakEaVOrZF29nnbeYikC1rP8czefscK3nQLV8utB4/TxfC8qPFKR7dR5vwcFhVxw+S27ckQp9fOshS2g7xh+yMTUk6Aa3ScvbRbbOHXhv0HnBaWGfUY62Dfip6rAMjGXcgY2ULsDq0F/zfvOznI8BtDcLmUNirEF55ClT1XRLdY4Gwn4uAkEF5gTyDVrKxzOMj694IPePijCKvjraCCNxcWzNrnXboU62aeFO0x0S0IJwgH7p/DMt2itdVLoNymLC+dGvJVs3g6XIZ0qTVWtMhPOTzNygYvvMydMUCvb8GvIKiZ+WbCQsJZJMvf44hjF2ZMmhof9J+gQmR+rVK83XjRdxlHcU/2IPPci/huTgwE5J0+8ir6eaa2cfpQFg4zy51PNFfNH5b75bRrmPqTVD1Hdo59KQaQzRHw9O6YnSMx92op/tjxp9bCPsRWYEaxYP9xJVmanWc4XcvIJom+0k53mwWRrzM6YfTYK/2a3xr0/FYz0UmqL1j5dTtrUfz/dhPdA8Jg2MkRtATT3f6FFZH/VRCW12l2dQSZJVf2/A03B+oc5yZlahwIL5GgDYniV0/dPTBLj7M2Z+/gpDzRvdAo8iJRRlr3wKpBHXw4xkp3HPpIV8Tq0qJx2lM1MNNBaX4pNj+9hLRZenUyN43MwwY9+LHgbAyIE4w0MAap2ESZG+ZosHWrQ/QbMeG+UVIIJuBcBGsis9CVgeI/a9JuYT6NEg+7IZ35r+dWBpkGG6berU1gsz83aXr5OIA1fJsPTFCCtCUNrf4wlSkHqK9+YBGTNqkWnH+2550+4Z/Bf8oLzTtdceRn5jf3PEhhiN5j8HJo0S3Sy6RixWRVGLNL+QRJ6NkS4eCYEHm1hc2UVROeOnSDVNAVqZsYrh0l W6D4sit5 OWIZyZYNhGZkkkVoR4DSRxtf2soojHGKwI+MsU1D3tBt61+BZyXjrs+AT3AjiY5rWwg9m7KxCDKtbhlj1Bs7k6YCJwLSjsfd5WrQlk8qSSMRji4fHNsItFGjwmlUST0w+bPiAD4HGhjC/iLerXt8UBvgEOmkOS/cyURVofwI2wSPU1jfx+roGzD5DChsZ3FvQZ5LPXQWk0qZYIOf1/q+4qrHDnmT5J1etsd4wOHjUVoHiNCM6zuT9+iQL6CSngsL9K87Y/yzBdclqj8KtIqe2TUHdN7ZmQTQClFT6Y3CBRBfk+YfcaiYFISAetpHnHl+MslSr32+nnPiAb+umYrlp9RzzoX2PWJ7Qyo73e2xFifYwJypWkKCT83anONXHx17q8DYjYb4b/f2Qhic5VBgVvE8Ilm2mXoeef3C4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri 17-02-23 13:47:48, Hugh Dickins wrote: > On Mon, 13 Feb 2023, Huang Ying wrote: > > > From: "Huang, Ying" > > > > Now, migrate_pages() migrate folios one by one, like the fake code as > > follows, > > > > for each folio > > unmap > > flush TLB > > copy > > restore map > > > > If multiple folios are passed to migrate_pages(), there are > > opportunities to batch the TLB flushing and copying. That is, we can > > change the code to something as follows, > > > > for each folio > > unmap > > for each folio > > flush TLB > > for each folio > > copy > > for each folio > > restore map > > > > The total number of TLB flushing IPI can be reduced considerably. And > > we may use some hardware accelerator such as DSA to accelerate the > > folio copying. > > > > So in this patch, we refactor the migrate_pages() implementation and > > implement the TLB flushing batching. Base on this, hardware > > accelerated folio copying can be implemented. > > > > If too many folios are passed to migrate_pages(), in the naive batched > > implementation, we may unmap too many folios at the same time. The > > possibility for a task to wait for the migrated folios to be mapped > > again increases. So the latency may be hurt. To deal with this > > issue, the max number of folios be unmapped in batch is restricted to > > no more than HPAGE_PMD_NR in the unit of page. That is, the influence > > is at the same level of THP migration. > > > > We use the following test to measure the performance impact of the > > patchset, > > > > On a 2-socket Intel server, > > > > - Run pmbench memory accessing benchmark > > > > - Run `migratepages` to migrate pages of pmbench between node 0 and > > node 1 back and forth. > > > > With the patch, the TLB flushing IPI reduces 99.1% during the test and > > the number of pages migrated successfully per second increases 291.7%. > > > > Xin Hao helped to test the patchset on an ARM64 server with 128 cores, > > 2 NUMA nodes. Test results show that the page migration performance > > increases up to 78%. > > > > This patchset is based on mm-unstable 2023-02-10. > > And back in linux-next this week: I tried next-20230217 overnight. > > There is a deadlock in this patchset (and in previous versions: sorry > it's taken me so long to report), but I think one that's easily solved. > > I've not bisected to precisely which patch (load can take several hours > to hit the deadlock), but it doesn't really matter, and I expect that > you can guess. > > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE), > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs. > So, plenty of ext4 page cache and buffer_heads. > > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(), > either in kcompactd0 or in khugepaged trying to compact, or in both: > it ends up calling __lock_buffer(), and that schedules away, waiting > forever to get BH_lock. I have not identified who is holding BH_lock, > but I imagine a jbd2 journalling thread, and presume that it wants one > of the folio locks which migrate_pages_batch() is already holding; or > maybe it's all more convoluted than that. Other tasks then back up > waiting on those folio locks held in the batch. > > Never a problem with buffer_migrate_folio(), always with the "more > careful" buffer_migrate_folio_norefs(). And the patch below fixes > it for me: I've had enough hours with it now, on enough occasions, > to be confident of that. > > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2 > very well, and I hope can assure us that there is an understandable > deadlock here, from holding several random folio locks, then trying > to lock buffers. Cc'ing fsdevel, because there's a risk that mm > folk think something is safe, when it's not sufficient to cope with > the diversity of filesystems. I hope nothing more than the below is > needed (and I've had no other problems with the patchset: good job), > but cannot be sure. I suspect it can indeed be caused by the presence of the loop device as Huang Ying has suggested. What filesystems using buffer_heads do is a pattern like: bh = page_buffers(loop device page cache page); lock_buffer(bh); submit_bh(bh); - now on loop dev this ends up doing: lo_write_bvec() vfs_iter_write() ... folio_lock(backing file folio); So if migration code holds "backing file folio" lock and at the same time waits for 'bh' lock (while trying to migrate loop device page cache page), it is a deadlock. Proposed solution of never waiting for locks in batched mode looks like a sensible one to me... Honza -- Jan Kara SUSE Labs, CR