From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C5C5E7AD41 for ; Tue, 3 Oct 2023 13:19:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AC4B68D0071; Tue, 3 Oct 2023 09:19:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A74408D0003; Tue, 3 Oct 2023 09:19:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 93C878D0071; Tue, 3 Oct 2023 09:19:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 82D2B8D0003 for ; Tue, 3 Oct 2023 09:19:53 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 2520C80313 for ; Tue, 3 Oct 2023 13:19:53 +0000 (UTC) X-FDA: 81304207866.04.5DED26A Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf01.hostedemail.com (Postfix) with ESMTP id D94D540120 for ; Tue, 3 Oct 2023 13:18:55 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b="0L1N84/g"; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=SBF+rWXc; spf=pass (imf01.hostedemail.com: domain of jack@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696339136; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LqjOkdf8wEGORSfxlGDuLFjsaB2EonO8sU3/o8407WE=; b=1V6mFoP5e4OsO9LCBME2D0N57kC7bRdiM+Yb2p+2LEvRfsIq581wFfF40nfQxs+YyPk0QO n8dJQEnbl7uLcAz4NxWEQ4bTYPJ1b8mmTZH/2UfSOMtg7RoUM9SWPCsTorUeI5puRWpg+j aLTyvnjHfHMgYRFbqaQbDuvS5pAmmvo= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b="0L1N84/g"; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=SBF+rWXc; spf=pass (imf01.hostedemail.com: domain of jack@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696339136; a=rsa-sha256; cv=none; b=0rlG10hTEa/I2X580A3GJ5yd9ch7NhwK6cauUL6cpHduMeYSJHiy2eBGQHMnADKCCyZVyf S1+cWEfdrK3QYxfcwUWO8ZBMMLERtQCsE2wSOaLA/O8ZHN0jLlN8KHyShqXX6Tr8eIEA1s u2gu1TyYR5XvUjwMp/cdUljaju8hKNI= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 29A0C2189B; Tue, 3 Oct 2023 13:18:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1696339134; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=LqjOkdf8wEGORSfxlGDuLFjsaB2EonO8sU3/o8407WE=; b=0L1N84/g4JW4Pz4EHuBiDWpCHedzO4xCWFk93E412HzHjVHMG2caaqbBmKjd54UvzEKNW3 XzqOvxVn5fmUsDC+LZAx5acFvUUAydhayTqik7QMcHtRVwZOMHjUEka2gvCaYJEOG4zliA CCcjwTM5k6bjfVExditBMu97suSKm1E= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1696339134; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=LqjOkdf8wEGORSfxlGDuLFjsaB2EonO8sU3/o8407WE=; b=SBF+rWXcUn9e/501dNskOYmABTAkWJTPRFeceh/Zvkr8VgH4m0t+8/qME8tZhGLLf9Ho6y mRFWnMNlgk0DB+BQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 1BF73132D4; Tue, 3 Oct 2023 13:18:54 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id CIjIBr4UHGXpMwAAMHmgww (envelope-from ); Tue, 03 Oct 2023 13:18:54 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id A9412A07CC; Tue, 3 Oct 2023 15:18:53 +0200 (CEST) Date: Tue, 3 Oct 2023 15:18:53 +0200 From: Jan Kara To: Hugh Dickins Cc: Andrew Morton , Christian Brauner , Carlos Maiolino , Chuck Lever , Jan Kara , Matthew Wilcox , Johannes Weiner , Axel Rasmussen , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault() Message-ID: <20231003131853.ramdlfw5s6ne4iqx@quack3> References: <6fe379a4-6176-9225-9263-fe60d2633c0@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6fe379a4-6176-9225-9263-fe60d2633c0@google.com> X-Rspamd-Queue-Id: D94D540120 X-Rspam-User: X-Stat-Signature: 8aozxtttzpr656adecqtsq7jfp9e9jm9 X-Rspamd-Server: rspam01 X-HE-Tag: 1696339135-993104 X-HE-Meta: U2FsdGVkX1+1Wx2n+P+vyUdlB5XbCM+pxekYQQcAqDxyO5kxaNTTJYpXSXJr9cmcWTnjyV2i8C0tkbTMZFVrFJdlI1g2f2pMVTzfIAdKhcYptcJIGP4qYAy9kXQM/ko1Beb8XEocH31A3OGFQaLHsgXY5lLKzhMtvudQVNnU0KR9GXwT7Wd+FnwRSVVd51xBO+VXJQOC4I5+ydWJHi5n7K6xfAzK+jjZkCqs+XG6w9ZwTOWa9MpIcL3cs7M2/k8fwqy7J5TLwNpAkne5/QQiTn2RuFRwjLlc1lEztjPxpL9PCAwAjDdSfmBeBfLXkAID01sh+6ycFfKuez4tUR5ko0NVqJVABunXbx4tJeBQoapy34Db/DTYBQYKnKr07toyryl5TPynckLkL27588BcJ6ZWcElT6V2M9WeFQHCNyEIC/yUkbkHFALAWm9YRpjTyPwvrAzG7x1kJfZ/LZWcdqvJtYfgdK44yrG0WttdeWmZAIO2HStjxRJ2ZrMURmGGccMu7eEmPbrsMSOp/+n4bwMHZQe48GiMYGGRc2qRbOeue4qfepv72Ui3TEqtJ5lg0ktj/VZI7E7r8XQkMJ2wtHUAiFkxqUyJCRdXboH8oQBJPjXNR33pw/qIs+cv9PahexAMFSKFcUhPREeRyhdATc8D0kFwYqy8KXsDczBaJF+bjUBoLCrXSp85qHnYxPmxjn4dP9ekJPWMbaP6aExG51+XtKoiOheCyOggZUsGolQl4CNj7OpfZ4muOs7UTJgC1Yf2weOz4/RZy3waG3WW58I9sVqaTxsCG2StiVr+MYChZPb+hs+vSYues4trXAnsfn+RpyOdD+hCZHgRZb1tvijEYOiANgPgapcDoux12yKygCQbaobLRp9eyvT8M056bfUaOK0m2UDzMUNZNNtkv6T1KtA24Gtk1Ypn5WdBlpcsA3zm9jV/7CNEFMiIi12f91+RCSqRxYDpkpMXFMNj VUmoCRKY vWq6wgQXLqUyNqZn3LUd56+MaBNOA7poE7yqcDb/c4xGpJjp6E5Gh+XTSZFRqdBfDxwKZtqr1oNWw97cYV7cvbeNXtWy5rQue/JP+VDNf1Gq5V/z00m81n9UaFtgfIVimdYNG1AbOBKRqPYU5EU2avKEj8xdciHmiJlO9QmETlWew+iT+/KdOBRNL1BOfq/rj+ooYry9YHEgbD6lN3GShTkCZKHErYWQWqYi7RCpmkmoFWH/s2tOJGRCLx9jWWMQiZfyKI00IBbDWYGtZCUuNwBAHonQEPjJHEeT/3L3PlyEUuExQZhpUv0GqIm/4I/5heOf7 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri 29-09-23 20:27:53, Hugh Dickins wrote: > That Trinity livelock shmem_falloc avoidance block is unlikely, and a > distraction from the proper business of shmem_fault(): separate it out. > (This used to help compilers save stack on the fault path too, but both > gcc and clang nowadays seem to make better choices anyway.) > > Signed-off-by: Hugh Dickins Looks good. Feel free to add: Reviewed-by: Jan Kara Looking at the code I'm just wondering whether the livelock with shmem_undo_range() couldn't be more easy to avoid by making shmem_undo_range() always advance the index by 1 after evicting a page and thus guaranteeing a forward progress... Because the forward progress within find_get_entries() is guaranteed these days, it should be enough. Honza > --- > mm/shmem.c | 126 +++++++++++++++++++++++++++++------------------------ > 1 file changed, 69 insertions(+), 57 deletions(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index 824eb55671d2..5501a5bc8d8c 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2148,87 +2148,99 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop, > * entry unconditionally - even if something else had already woken the > * target. > */ > -static int synchronous_wake_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) > +static int synchronous_wake_function(wait_queue_entry_t *wait, > + unsigned int mode, int sync, void *key) > { > int ret = default_wake_function(wait, mode, sync, key); > list_del_init(&wait->entry); > return ret; > } > > +/* > + * Trinity finds that probing a hole which tmpfs is punching can > + * prevent the hole-punch from ever completing: which in turn > + * locks writers out with its hold on i_rwsem. So refrain from > + * faulting pages into the hole while it's being punched. Although > + * shmem_undo_range() does remove the additions, it may be unable to > + * keep up, as each new page needs its own unmap_mapping_range() call, > + * and the i_mmap tree grows ever slower to scan if new vmas are added. > + * > + * It does not matter if we sometimes reach this check just before the > + * hole-punch begins, so that one fault then races with the punch: > + * we just need to make racing faults a rare case. > + * > + * The implementation below would be much simpler if we just used a > + * standard mutex or completion: but we cannot take i_rwsem in fault, > + * and bloating every shmem inode for this unlikely case would be sad. > + */ > +static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode) > +{ > + struct shmem_falloc *shmem_falloc; > + struct file *fpin = NULL; > + vm_fault_t ret = 0; > + > + spin_lock(&inode->i_lock); > + shmem_falloc = inode->i_private; > + if (shmem_falloc && > + shmem_falloc->waitq && > + vmf->pgoff >= shmem_falloc->start && > + vmf->pgoff < shmem_falloc->next) { > + wait_queue_head_t *shmem_falloc_waitq; > + DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); > + > + ret = VM_FAULT_NOPAGE; > + fpin = maybe_unlock_mmap_for_io(vmf, NULL); > + shmem_falloc_waitq = shmem_falloc->waitq; > + prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait, > + TASK_UNINTERRUPTIBLE); > + spin_unlock(&inode->i_lock); > + schedule(); > + > + /* > + * shmem_falloc_waitq points into the shmem_fallocate() > + * stack of the hole-punching task: shmem_falloc_waitq > + * is usually invalid by the time we reach here, but > + * finish_wait() does not dereference it in that case; > + * though i_lock needed lest racing with wake_up_all(). > + */ > + spin_lock(&inode->i_lock); > + finish_wait(shmem_falloc_waitq, &shmem_fault_wait); > + } > + spin_unlock(&inode->i_lock); > + if (fpin) { > + fput(fpin); > + ret = VM_FAULT_RETRY; > + } > + return ret; > +} > + > static vm_fault_t shmem_fault(struct vm_fault *vmf) > { > - struct vm_area_struct *vma = vmf->vma; > - struct inode *inode = file_inode(vma->vm_file); > + struct inode *inode = file_inode(vmf->vma->vm_file); > gfp_t gfp = mapping_gfp_mask(inode->i_mapping); > struct folio *folio = NULL; > + vm_fault_t ret = 0; > int err; > - vm_fault_t ret = VM_FAULT_LOCKED; > > /* > * Trinity finds that probing a hole which tmpfs is punching can > - * prevent the hole-punch from ever completing: which in turn > - * locks writers out with its hold on i_rwsem. So refrain from > - * faulting pages into the hole while it's being punched. Although > - * shmem_undo_range() does remove the additions, it may be unable to > - * keep up, as each new page needs its own unmap_mapping_range() call, > - * and the i_mmap tree grows ever slower to scan if new vmas are added. > - * > - * It does not matter if we sometimes reach this check just before the > - * hole-punch begins, so that one fault then races with the punch: > - * we just need to make racing faults a rare case. > - * > - * The implementation below would be much simpler if we just used a > - * standard mutex or completion: but we cannot take i_rwsem in fault, > - * and bloating every shmem inode for this unlikely case would be sad. > + * prevent the hole-punch from ever completing: noted in i_private. > */ > if (unlikely(inode->i_private)) { > - struct shmem_falloc *shmem_falloc; > - > - spin_lock(&inode->i_lock); > - shmem_falloc = inode->i_private; > - if (shmem_falloc && > - shmem_falloc->waitq && > - vmf->pgoff >= shmem_falloc->start && > - vmf->pgoff < shmem_falloc->next) { > - struct file *fpin; > - wait_queue_head_t *shmem_falloc_waitq; > - DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); > - > - ret = VM_FAULT_NOPAGE; > - fpin = maybe_unlock_mmap_for_io(vmf, NULL); > - if (fpin) > - ret = VM_FAULT_RETRY; > - > - shmem_falloc_waitq = shmem_falloc->waitq; > - prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait, > - TASK_UNINTERRUPTIBLE); > - spin_unlock(&inode->i_lock); > - schedule(); > - > - /* > - * shmem_falloc_waitq points into the shmem_fallocate() > - * stack of the hole-punching task: shmem_falloc_waitq > - * is usually invalid by the time we reach here, but > - * finish_wait() does not dereference it in that case; > - * though i_lock needed lest racing with wake_up_all(). > - */ > - spin_lock(&inode->i_lock); > - finish_wait(shmem_falloc_waitq, &shmem_fault_wait); > - spin_unlock(&inode->i_lock); > - > - if (fpin) > - fput(fpin); > + ret = shmem_falloc_wait(vmf, inode); > + if (ret) > return ret; > - } > - spin_unlock(&inode->i_lock); > } > > + WARN_ON_ONCE(vmf->page != NULL); > err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, > gfp, vmf, &ret); > if (err) > return vmf_error(err); > - if (folio) > + if (folio) { > vmf->page = folio_file_page(folio, vmf->pgoff); > + ret |= VM_FAULT_LOCKED; > + } > return ret; > } > > -- > 2.35.3 > -- Jan Kara SUSE Labs, CR