From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9F64C433ED for ; Tue, 11 May 2021 14:01:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6547661625 for ; Tue, 11 May 2021 14:01:31 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6547661625 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A3C346B006E; Tue, 11 May 2021 10:01:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A121D6B0071; Tue, 11 May 2021 10:01:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 88D0A6B0072; Tue, 11 May 2021 10:01:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0194.hostedemail.com [216.40.44.194]) by kanga.kvack.org (Postfix) with ESMTP id 6D23C6B006E for ; Tue, 11 May 2021 10:01:30 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 2AC3A180B144F for ; Tue, 11 May 2021 14:01:30 +0000 (UTC) X-FDA: 78129112740.29.B151A5A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf07.hostedemail.com (Postfix) with ESMTP id DF29AA0049CA for ; Tue, 11 May 2021 14:01:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1620741681; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=/J2SbDDpf1/xl12w6sV/ZvOSp9oddaI9kZRn1S4ksFA=; b=MBWemO5ULQJSPpJ/5Q9pjQpNyZ4yELbNsT5g5pWa8aS0IidVJbXFTuqT8WgevS7XlcOCx+ zQtQDobslKAR1x5ZHFCsBUow1bo7t5o5nXt7Uq8orjPsY8k+tKLBZtmT6O7qtF+q5UAZJi bWCVqpQdwPaSh02iVQDtYsELfw3gbF8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-506-UbNFmkdyNFOv3l7m6nD3FA-1; Tue, 11 May 2021 10:01:19 -0400 X-MC-Unique: UbNFmkdyNFOv3l7m6nD3FA-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 067BA18E5641; Tue, 11 May 2021 14:01:17 +0000 (UTC) Received: from max.com (unknown [10.40.195.97]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8040E5D6D1; Tue, 11 May 2021 14:01:15 +0000 (UTC) From: Andreas Gruenbacher To: linux-fsdevel@vger.kernel.org Cc: cluster-devel@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jan Kara , Andreas Gruenbacher Subject: [PATCH] [RFC] Trigger retry from fault vm operation Date: Tue, 11 May 2021 16:01:13 +0200 Message-Id: <20210511140113.1225981-1-agruenba@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Rspamd-Queue-Id: DF29AA0049CA Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MBWemO5U; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf07.hostedemail.com: domain of agruenba@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=agruenba@redhat.com X-Rspamd-Server: rspam03 X-Stat-Signature: sx67y6xzc6jekjwpqjdbdhewbhecf3hn Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf07; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1620741683-797904 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, we have a locking problem in gfs2 that I don't have a proper solution for= , so I'm looking for suggestions. What's happening is that a page fault triggers during a read or write operation, while we're holding a glock (the cluster-wide gfs2 inode lock), and the page fault requires another glock. We can recognize and handle the case when both glocks are the same, but when the page fault re= quires another glock, there is a chance that taking that other glock would deadl= ock. When we realize that we may not be able to take the other glock in gfs2_f= ault, we need to communicate that to the read or write operation, which will th= en drop and re-acquire the "outer" glock and retry. However, there doesn't = seem to be a good way to do that; we can only indicate that a page fault shoul= d fail by returning VM_FAULT_SIGBUS or similar; that will then be mapped to -EFA= ULT. We'd need something like VM_FAULT_RESTART that can be mapped to -EBUSY so= that we can tell the retry case apart from genuine -EFAULT errors. To show what I mean, below is a proof of concept that adds a restart_hack= flag to struct task_struct as a side channel. An even uglier alternative woul= d be to abuse task->journal_info. The obvious response to this email would be "fix your locking order". We= ll, we can do that by pinning the user pages in gfs2_file_{read,write}_iter befo= re taking the "outer" glock, but we'd ideally only do that when it's actuall= y necessary (i.e., when gfs2_fault has indicated to retry). Thanks for any thoughts. Andreas --- fs/gfs2/file.c | 37 +++++++++++++++++++++++++++++++++++-- include/linux/sched.h | 1 + 2 files changed, 36 insertions(+), 2 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 658fed79d65a..253e720f2df0 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -543,10 +543,15 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf) vm_fault_t ret; int err; =20 - gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); + gfs2_holder_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_TRY, &gh); if (likely(!recursive)) { err =3D gfs2_glock_nq(&gh); if (err) { + if (err =3D=3D GLR_TRYFAILED) { + current->restart_hack =3D 1; + ret =3D VM_FAULT_SIGBUS; + goto out_uninit; + } ret =3D block_page_mkwrite_return(err); goto out_uninit; } @@ -773,12 +778,16 @@ static ssize_t gfs2_file_direct_read(struct kiocb *= iocb, struct iov_iter *to, return 0; /* skip atime */ =20 gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); +restart: ret =3D gfs2_glock_nq(gh); if (ret) goto out_uninit; =20 + current->restart_hack =3D 0; ret =3D iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0); gfs2_glock_dq(gh); + if (unlikely(ret =3D=3D -EFAULT) && current->restart_hack) + goto restart; out_uninit: gfs2_holder_uninit(gh); return ret; @@ -803,6 +812,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *i= ocb, struct iov_iter *from, * VFS does. */ gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); + +restart: ret =3D gfs2_glock_nq(gh); if (ret) goto out_uninit; @@ -811,11 +822,14 @@ static ssize_t gfs2_file_direct_write(struct kiocb = *iocb, struct iov_iter *from, if (offset + len > i_size_read(&ip->i_inode)) goto out; =20 + current->restart_hack =3D 0; ret =3D iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0); if (ret =3D=3D -ENOTBLK) ret =3D 0; out: gfs2_glock_dq(gh); + if (unlikely(ret =3D=3D -EFAULT) && current->restart_hack) + goto restart; out_uninit: gfs2_holder_uninit(gh); return ret; @@ -834,7 +848,9 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb= , struct iov_iter *to) return ret; iocb->ki_flags &=3D ~IOCB_DIRECT; } +restart1: iocb->ki_flags |=3D IOCB_NOIO; + current->restart_hack =3D 0; ret =3D generic_file_read_iter(iocb, to); iocb->ki_flags &=3D ~IOCB_NOIO; if (ret >=3D 0) { @@ -842,6 +858,8 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb= , struct iov_iter *to) return ret; written =3D ret; } else { + if (unlikely(ret =3D=3D -EFAULT) && current->restart_hack) + goto restart1; if (ret !=3D -EAGAIN) return ret; if (iocb->ki_flags & IOCB_NOWAIT) @@ -849,13 +867,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *io= cb, struct iov_iter *to) } ip =3D GFS2_I(iocb->ki_filp->f_mapping->host); gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); + +restart2: ret =3D gfs2_glock_nq(&gh); if (ret) goto out_uninit; + current->restart_hack =3D 0; ret =3D generic_file_read_iter(iocb, to); if (ret > 0) written +=3D ret; gfs2_glock_dq(&gh); + if (unlikely(ret =3D=3D -EFAULT) && current->restart_hack) + goto restart2; out_uninit: gfs2_holder_uninit(&gh); return written ? written : ret; @@ -912,13 +935,18 @@ static ssize_t gfs2_file_write_iter(struct kiocb *i= ocb, struct iov_iter *from) goto out_unlock; =20 iocb->ki_flags |=3D IOCB_DSYNC; +restart1: + current->restart_hack =3D 0; current->backing_dev_info =3D inode_to_bdi(inode); buffered =3D iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops); current->backing_dev_info =3D NULL; if (unlikely(buffered <=3D 0)) { + if (unlikely(buffered =3D=3D -EFAULT) && current->restart_hack) + goto restart1; if (!ret) ret =3D buffered; goto out_unlock; + } =20 /* * We need to ensure that the page cache pages are written to @@ -935,10 +963,15 @@ static ssize_t gfs2_file_write_iter(struct kiocb *i= ocb, struct iov_iter *from) if (!ret || ret2 > 0) ret +=3D ret2; } else { +restart2: + current->restart_hack =3D 0; current->backing_dev_info =3D inode_to_bdi(inode); ret =3D iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops); current->backing_dev_info =3D NULL; - if (likely(ret > 0)) { + if (unlikely(ret <=3D 0)) { + if (unlikely(ret =3D=3D -EFAULT) && current->restart_hack) + goto restart2; + } else { iocb->ki_pos +=3D ret; ret =3D generic_write_sync(iocb, ret); } diff --git a/include/linux/sched.h b/include/linux/sched.h index d2c881384517..de053ac8d8d6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1067,6 +1067,7 @@ struct task_struct { =20 /* Journalling filesystem info: */ void *journal_info; + unsigned restart_hack:1; =20 /* Stacked block device info: */ struct bio_list *bio_list; --=20 2.26.3