From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3B3CC433F5 for ; Wed, 9 Mar 2022 18:42:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EAB6C8D0002; Wed, 9 Mar 2022 13:42:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E5B808D0001; Wed, 9 Mar 2022 13:42:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D232F8D0002; Wed, 9 Mar 2022 13:42:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0211.hostedemail.com [216.40.44.211]) by kanga.kvack.org (Postfix) with ESMTP id C3D3C8D0001 for ; Wed, 9 Mar 2022 13:42:51 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 71D35181CB161 for ; Wed, 9 Mar 2022 18:42:51 +0000 (UTC) X-FDA: 79225719342.22.0349802 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf03.hostedemail.com (Postfix) with ESMTP id DB7782001A for ; Wed, 9 Mar 2022 18:42:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1646851370; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EiYo1sQ+iyTxNxBo5i2JTVVNprtfKtFAyovPZOOp01o=; b=NJDXLwQASXGcZBoMsgf1tpLXkY3h+QgACknHZFDRj7ZkMJdPj4919x+IaRmmrpRHkVBq0I d2EDwXG2Oio7jxiuQYDfqmpsO/JQBouwSYd6YH1KDSKqkC2HxPg5Bd5IDzDz7dyaUj+fig HScS96SBrvVNMEoxeUyA3Gm7UYtQH+A= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-622-R6LYSyqyP3ineM-H9bcLww-1; Wed, 09 Mar 2022 13:42:47 -0500 X-MC-Unique: R6LYSyqyP3ineM-H9bcLww-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E72C02F27; Wed, 9 Mar 2022 18:42:45 +0000 (UTC) Received: from max.com (unknown [10.40.192.17]) by smtp.corp.redhat.com (Postfix) with ESMTP id 524F28EEB7; Wed, 9 Mar 2022 18:42:40 +0000 (UTC) From: Andreas Gruenbacher To: Linus Torvalds Cc: Andreas Gruenbacher , Catalin Marinas , David Hildenbrand , Alexander Viro , linux-s390 , Linux-MM , linux-fsdevel , linux-btrfs Subject: Re: Buffered I/O broken on s390x with page faults disabled (gfs2) Date: Wed, 9 Mar 2022 19:42:38 +0100 Message-Id: <20220309184238.1583093-1-agruenba@redhat.com> In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: DB7782001A X-Stat-Signature: kx8bwjn1eowoqtgdahgw9xhm3fzdxkdp Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NJDXLwQA; spf=none (imf03.hostedemail.com: domain of agruenba@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=agruenba@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspam-User: X-HE-Tag: 1646851370-876089 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Mar 9, 2022 at 1:22 AM Linus Torvalds wrote:=0D > On Tue, Mar 8, 2022 at 3:25 PM Andreas Gruenbacher = wrote:=0D > >=0D > > Seems to be working on s390x for this test case at least; the kind of=0D > > trace I'm getting is:=0D >=0D > Good.=0D >=0D > > This shows bursts of successful fault-ins in gfs2_file_read_iter=0D > > (read_fault). The pauses in between might be caused by the storage;=0D > > I'm not sure.=0D >=0D > Don't know about the pauses, but the burst size might be impacted by that= =0D >=0D > + =C2=A0 =C2=A0 =C2=A0 const size_t max_size =3D 4 * PAGE_SIZE;=0D >=0D > thing that limits how many calls to fixup_user_fault() we do per=0D > fault_in_safe_writeable().=0D >=0D > So it might be worth checking if that value seems to make any difference.= =0D >=0D > > I'd still let the caller of fault_in_safe_writeable() decide how much=0D > > memory to fault in; the tight cap in fault_in_safe_writeable() doesn't= =0D > > seem useful.=0D >=0D > Well, there are real latency concerns there - fixup_user_fault() is=0D > not necessarily all that low-cost.=0D =0D I just don't know if making the GUP based approach work instead of=0D switching to fixup_user_fault(), or introducing something else, would=0D make enough of a performance difference to be worth it.=0D =0D > And it's actually going to be worse when we have the sub-page coloring=0D > issues happening, and will need to probe at a 128-byte granularity=0D > (not on s390, but on arm64).=0D >=0D > At that point, we almost certainly will need to have a "probe user=0D > space non-destructibly for writability" instruction (possibly=0D > extending on our current arch_futex_atomic_op_inuser()=0D > infrastructure).=0D =0D Let me add Catalin Marinas to the CC.=0D =0D >From what I took from the previous discussion, probing at a sub-page=0D granularity won't be necessary for bytewise copying: when the address=0D we're trying to access is poisoned, fault_in_*() will fail; when we get=0D a short result, that will take us to the poisoned address in the next=0D iteration.=0D =0D The problematic case was copying objects that cross fault domains, when=0D we're getting an all-or-nothing result back from the copying and the=0D address we're trying to fault_in_*() isn't the address of the actual=0D fault. The fix for those cases is to pass back the address of the=0D actual fault in one way or another.=0D =0D > So honestly, if you do IO on areas that will get page faults on them,=0D > to some degree it's a "you are doing something stupid, you get what=0D > you deserve". This code should _work_, it should not have to worry=0D > about users having bad patterns (where "read data into huge cold=0D > mappings under enough memory pressure that it causes access bit faults=0D > in the middle of the read" very much counts as such a bad pattern).=0D =0D With a large enough buffer, a simple malloc() will return unmapped=0D pages, and reading into such a buffer will result in fault-in. So page=0D faults during read() are actually pretty normal, and it's not the user's=0D fault.=0D =0D In my test case, the buffer was pre-initialized with memset() to avoid=0D those kinds of page faults, which meant that the page faults in=0D gfs2_file_read_iter() only started to happen when we were out of memory.=0D But that's not the common case.=0D =0D > > Also, you want to put in an extra L here:=0D > > > Signed-off-by: linus Torvalds =0D >=0D > Heh. Fixed locally.=0D =0D Attached is a revised patch; only lightly tested so far. Changes:=0D =0D * Fix the function description.=0D =0D * No need for untagged_addr() as that is handled in fixup_user_fault().=0D =0D * Get rid of max_size: it really makes no sense to second-guess what the=0D caller needs. In cases where fault_in_safe_writeable() is used for=0D buffers larger than a handful of pages, it is entirely the caller's=0D responsibility to scale back the fault-in size in anticipation of or=0D in reaction to page-out.=0D =0D * Use the same control flow as in fault_in_readable(); there is no=0D need for anything more complicated anymore.=0D =0D * The same patch description still applies.=0D =0D Thanks,=0D Andreas=0D =0D diff --git a/mm/gup.c b/mm/gup.c=0D index f1bf3a1f6d109..5e777049bdf41 100644=0D --- a/mm/gup.c=0D +++ b/mm/gup.c=0D @@ -1841,9 +1841,9 @@ EXPORT_SYMBOL(fault_in_writeable);=0D * @uaddr: start of address range=0D * @size: length of address range=0D *=0D - * Faults in an address range using get_user_pages, i.e., without triggeri= ng=0D - * hardware page faults. This is primarily useful when we already know th= at=0D - * some or all of the pages in the address range aren't in memory.=0D + * Faults in an address range for writing. This is primarily useful when = we=0D + * already know that some or all of the pages in the address range aren't = in=0D + * memory.=0D *=0D * Other than fault_in_writeable(), this function is non-destructive.=0D *=0D @@ -1856,46 +1856,33 @@ EXPORT_SYMBOL(fault_in_writeable);=0D */=0D size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)=0D {=0D - unsigned long start =3D (unsigned long)untagged_addr(uaddr);=0D - unsigned long end, nstart, nend;=0D + const unsigned int fault_flags =3D FAULT_FLAG_WRITE | FAULT_FLAG_KILLABLE= ;=0D + unsigned long start =3D (unsigned long)uaddr, end;=0D struct mm_struct *mm =3D current->mm;=0D - struct vm_area_struct *vma =3D NULL;=0D - int locked =3D 0;=0D =0D - nstart =3D start & PAGE_MASK;=0D + if (unlikely(size =3D=3D 0))=0D + return 0;=0D end =3D PAGE_ALIGN(start + size);=0D - if (end < nstart)=0D + if (end < start)=0D end =3D 0;=0D - for (; nstart !=3D end; nstart =3D nend) {=0D - unsigned long nr_pages;=0D - long ret;=0D =0D - if (!locked) {=0D - locked =3D 1;=0D - mmap_read_lock(mm);=0D - vma =3D find_vma(mm, nstart);=0D - } else if (nstart >=3D vma->vm_end)=0D - vma =3D vma->vm_next;=0D - if (!vma || vma->vm_start >=3D end)=0D - break;=0D - nend =3D end ? min(end, vma->vm_end) : vma->vm_end;=0D - if (vma->vm_flags & (VM_IO | VM_PFNMAP))=0D - continue;=0D - if (nstart < vma->vm_start)=0D - nstart =3D vma->vm_start;=0D - nr_pages =3D (nend - nstart) / PAGE_SIZE;=0D - ret =3D __get_user_pages_locked(mm, nstart, nr_pages,=0D - NULL, NULL, &locked,=0D - FOLL_TOUCH | FOLL_WRITE);=0D - if (ret <=3D 0)=0D - break;=0D - nend =3D nstart + ret * PAGE_SIZE;=0D + mmap_read_lock(mm);=0D + if (!PAGE_ALIGNED(start)) {=0D + if (fixup_user_fault(mm, start, fault_flags, NULL))=0D + return size;=0D + start =3D PAGE_ALIGN(start);=0D }=0D - if (locked)=0D - mmap_read_unlock(mm);=0D - if (nstart =3D=3D end)=0D - return 0;=0D - return size - min_t(size_t, nstart - start, size);=0D + while (start !=3D end) {=0D + if (fixup_user_fault(mm, start, fault_flags, NULL))=0D + goto out;=0D + start +=3D PAGE_SIZE;=0D + }=0D + mmap_read_unlock(mm);=0D +=0D +out:=0D + if (size > (unsigned long)uaddr - start)=0D + return size - ((unsigned long)uaddr - start);=0D + return 0;=0D }=0D EXPORT_SYMBOL(fault_in_safe_writeable);=0D =0D