From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A6BF8C4332F
	for <linux-mm@archiver.kernel.org>; Wed, 19 Oct 2022 22:04:31 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3F17B6B0072; Wed, 19 Oct 2022 18:04:31 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3A1C36B0073; Wed, 19 Oct 2022 18:04:31 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2412C6B0074; Wed, 19 Oct 2022 18:04:31 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 149DE6B0072
	for <linux-mm@kvack.org>; Wed, 19 Oct 2022 18:04:31 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id DF69F160904
	for <linux-mm@kvack.org>; Wed, 19 Oct 2022 22:04:30 +0000 (UTC)
X-FDA: 80039078700.26.8C6DD71
Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169])
	by imf22.hostedemail.com (Postfix) with ESMTP id 6B2D6C0033
	for <linux-mm@kvack.org>; Wed, 19 Oct 2022 22:04:30 +0000 (UTC)
Received: by mail-pf1-f169.google.com with SMTP id 204so18497687pfx.10
        for <linux-mm@kvack.org>; Wed, 19 Oct 2022 15:04:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20210112.gappssmtp.com; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=qPPr52gydr+YpDNCezd9kxcY0djWznbGTpjOd8hQNBM=;
        b=knpnh/ZCQFeZblO2UW/YRz75i50/tNb3gWmpQLgjzm8aFWi4jFWdQRWmnbxqL+rkMe
         IMXQyMOU6ORNziFnjrdNIZLJerjSuVfV2yC+fLuAY5T6dD8FEsuuQU3kmmkLGF9OSfsr
         kGdMs1XkjkHhambE+tKZ2o+Lr+ojns0bzIaagmJURHEJ+cJ+d5/9fVySJJCpG0esyBLP
         6LfAlM7ASl20YqGozDGug9zw7v/6l+yn4BfR9w8cBcZfho1y8aCqCL3jtlyA+6lbwyM0
         MFEktdRCO3ubHHr4dTw+u+UdsKmNJ8GPXWKkINW8Of+63xLSxiUg+EUxkFRJtNsYkly8
         xkkw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=qPPr52gydr+YpDNCezd9kxcY0djWznbGTpjOd8hQNBM=;
        b=zX0ZaK/LfP2fxRpcOeCuitSaRUFg8tj7skDIVz+x3YEQW3UCiU17nkoWHUa0GVFsrf
         5D7x/bL+u4NgwS4m9fPixLeOL7l1ZX8NFPZGraGiwNCpba0FigdsmOxdqcp+zsG75C0E
         fkH0+tq/C4rH1DNnYO+vyEwBRkM02QrfzG1pi2XLRaebwlK7eP1X4ur21xXW5XfPYWS0
         +jPfAl4l8WwR2KKIv00DqDb1NasMIWhQ3+oD3uKWrYdG7x4pYc/43tueOVPfdycLi0zM
         6dC3fXL3GMnKIfOQayKB5eW2Md9u3IP9S4k4lIdOH5Y/k130jd3yp505INxAqij0t2N9
         o3/g==
X-Gm-Message-State: ACrzQf3wPMKAM3C35LC9HhRTttl/zeIHno4hngXthaEmyYw+XsEik/Ln
	Q8Asib4Aubvj8/QKkkzI3gNLyA==
X-Google-Smtp-Source: AMsMyM686+p5oDxxBbFkmK+4SgY0s0P4IWcYD6Y9JZ5ttkBDg26T6eEWufNU8nmiCc8CKem3lReu8w==
X-Received: by 2002:a65:5504:0:b0:42a:352d:c79c with SMTP id f4-20020a655504000000b0042a352dc79cmr9369720pgr.58.1666217069083;
        Wed, 19 Oct 2022 15:04:29 -0700 (PDT)
Received: from dread.disaster.area (pa49-181-106-210.pa.nsw.optusnet.com.au. [49.181.106.210])
        by smtp.gmail.com with ESMTPSA id rm14-20020a17090b3ece00b001df264610c4sm3608938pjb.0.2022.10.19.15.04.28
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 19 Oct 2022 15:04:28 -0700 (PDT)
Received: from dave by dread.disaster.area with local (Exim 4.92.3)
	(envelope-from <david@fromorbit.com>)
	id 1olHAm-00412U-FD; Thu, 20 Oct 2022 09:04:24 +1100
Date: Thu, 20 Oct 2022 09:04:24 +1100
From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Zhaoyang Huang <huangzhaoyang@gmail.com>,
	"zhaoyang.huang" <zhaoyang.huang@unisoc.com>,
	Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, ke.wang@unisoc.com,
	steve.kang@unisoc.com, baocong.liu@unisoc.com,
	linux-fsdevel@vger.kernel.org
Subject: Re: [RFC PATCH] mm: move xa forward when run across zombie page
Message-ID: <20221019220424.GO2703033@dread.disaster.area>
References: <1665725448-31439-1-git-send-email-zhaoyang.huang@unisoc.com>
 <Y0lSChlclGPkwTeA@casper.infradead.org>
 <CAGWkznG=_A-3A8JCJEoWXVcx+LUNH=gvXjLpZZs0cRX4dhUJfQ@mail.gmail.com>
 <Y017BeC64GDb3Kg7@casper.infradead.org>
 <CAGWkznEdtGPPZkHrq6Y_+XLL37w12aC8XN8R_Q-vhq48rFhkSA@mail.gmail.com>
 <Y04Y3RNq6D2T9rVw@casper.infradead.org>
 <20221018223042.GJ2703033@dread.disaster.area>
 <Y1AWXiJdyjdLmO1E@casper.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Y1AWXiJdyjdLmO1E@casper.infradead.org>
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666217070; a=rsa-sha256;
	cv=none;
	b=fK8tkZedoV4C+EcqhUhSptH+p6ODIiOhlkms/BZhfH8a760fZEc8TdLplYy1u3JkNaUcsh
	kamfce7MZyEk+NxfMPWPTO2fDeRzBbR6TVeXjL2URJDBIzyCAaffBCSX4uyqzoK5CFf9u+
	c4/ytVEWCkHK0hnp/b3G43aLgXzhuIQ=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b="knpnh/ZC";
	dmarc=none;
	spf=none (imf22.hostedemail.com: domain of david@fromorbit.com has no SPF policy when checking 209.85.210.169) smtp.mailfrom=david@fromorbit.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1666217070;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=qPPr52gydr+YpDNCezd9kxcY0djWznbGTpjOd8hQNBM=;
	b=zG0lc36l2iASCXl6TkkLpy3j2diviSF2NMH5mqJUcgHUI2ei3NG9L9F9wCu3ZqjeYAcm2P
	/GBtLLc/X4DNdDiWSSCl+zxNN8FNrG1uzeeNBtz4xJfzh2iP2lC0tXEt5MQK2CBu2fhGBh
	p/7IZOY+gdKYAC13MbMBfmtqceKEV/M=
X-Stat-Signature: dktzxyiwi1kb6uufqmbk1kryeh5nnm4x
X-Rspamd-Queue-Id: 6B2D6C0033
X-Rspam-User: 
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b="knpnh/ZC";
	dmarc=none;
	spf=none (imf22.hostedemail.com: domain of david@fromorbit.com has no SPF policy when checking 209.85.210.169) smtp.mailfrom=david@fromorbit.com
X-Rspamd-Server: rspam11
X-HE-Tag: 1666217070-226867
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > This is reading and writing the same amount of file data at the
> > application level, but once the data has been written and kicked out
> > of the page cache it seems to require an awful lot more read IO to
> > get it back to the application. i.e. this looks like mmap() is
> > readahead thrashing severely, and eventually it livelocks with this
> > sort of report:
> > 
> > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > [175901.995614] Call Trace:
> > [175901.996090]  <TASK>
> > [175901.996594]  ? __schedule+0x301/0xa30
> > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [175902.000714]  ? xas_start+0x53/0xc0
> > [175902.001484]  ? xas_load+0x24/0xa0
> > [175902.002208]  ? xas_load+0x5/0xa0
> > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > [175902.004693]  ? __do_fault+0x31/0x1d0
> > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > [175902.008613]  </TASK>
> > 
> > Given that filemap_fault on XFS is probably trying to map large
> > folios, I do wonder if this is a result of some kind of race with
> > teardown of a large folio...
> 
> It doesn't matter whether we're trying to map a large folio; it
> matters whether a large folio was previously created in the cache.
> Through the magic of readahead, it may well have been.  I suspect
> it's not teardown of a large folio, but splitting.  Removing a
> page from the page cache stores to the pointer in the XArray
> first (either NULL or a shadow entry), then decrements the refcount.
> 
> We must be observing a frozen folio.  There are a number of places
> in the MM which freeze a folio, but the obvious one is splitting.
> That looks like this:
> 
>         local_irq_disable();
>         if (mapping) {
>                 xas_lock(&xas);
> (...)
>         if (folio_ref_freeze(folio, 1 + extra_pins)) {

But the lookup is not doing anything to prevent the split on the
frozen page from making progress, right? It's not holding any folio
references, and it's not holding the mapping tree lock, either. So
how does the lookup in progress prevent the page split from making
progress?


> So one way to solve this might be to try to take the xa_lock on
> failure to get the refcount.  Otherwise a high-priority task
> might spin forever without a low-priority task getting the chance
> to finish the work being done while the folio is frozen.

IIUC, then you are saying that there is a scheduling priority
inversion because the lookup failure looping path doesn't yeild the
CPU?

If so, how does taking the mapping tree spin lock on failure cause
the looping task to yield the CPU and hence allow the folio split to
make progress?

Also, AFAICT, the page split has disabled local interrupts, so it
should effectively be running with preemption disabled as it has
turned off the mechanism the scheduler can use to preempt it. The
page split can't sleep, either, because it holds the mapping tree
lock. Hence I can't see how a split-in-progress can be preempted in
teh first place to cause a priority inversion livelock like this...

> ie this:
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 08341616ae7a..ca0eed80580f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
>  	if (!folio || xa_is_value(folio))
>  		goto out;
>  
> -	if (!folio_try_get_rcu(folio))
> +	if (!folio_try_get_rcu(folio)) {
> +		unsigned long flags;
> +
> +		xas_lock_irqsave(&xas, flags);
> +		xas_unlock_irqrestore(&xas, flags);
>  		goto repeat;
> +	}

I would have thought:

	if (!folio_try_get_rcu(folio)) {
		rcu_read_unlock();
		cond_resched();
		rcu_read_lock();
		goto repeat;
	}

Would be the right way to yeild the CPU to avoid priority
inversion related livelocks here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com