From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.8 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46A42C433DF for ; Mon, 10 Aug 2020 14:57:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EE4122078D for ; Mon, 10 Aug 2020 14:57:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="IVZy4/vV" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EE4122078D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 56FD88D0001; Mon, 10 Aug 2020 10:57:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 51F9B6B0003; Mon, 10 Aug 2020 10:57:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E6F98D0001; Mon, 10 Aug 2020 10:57:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0162.hostedemail.com [216.40.44.162]) by kanga.kvack.org (Postfix) with ESMTP id 240D96B0002 for ; Mon, 10 Aug 2020 10:57:09 -0400 (EDT) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id B4BD3181AEF00 for ; Mon, 10 Aug 2020 14:57:08 +0000 (UTC) X-FDA: 77134961736.01.music60_2d02fdb26fdb Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin01.hostedemail.com (Postfix) with ESMTP id 851FD1004D016 for ; Mon, 10 Aug 2020 14:57:08 +0000 (UTC) X-HE-Tag: music60_2d02fdb26fdb X-Filterd-Recvd-Size: 10258 Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) by imf20.hostedemail.com (Postfix) with ESMTP for ; Mon, 10 Aug 2020 14:57:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597071427; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=X94MiDwUendf6JmPezLRdErQYPoOiP1KjRkqlgzcETw=; b=IVZy4/vVitlAntbEf60YUk8WIP6hE1uI/MjJg1eH/834VS+cRdgNZaXoXRfJKXPbDp2ONr jhUJzihi9cqdgFRajSeG16ZuDUHeCS1I9JSuFUnS9/vDQzMgfLCbYxOCyHWNK3Dy0u4M5a 9PLBXY801atZY+chdu8jZMsemjSs7Tc= Received: from mail-qt1-f197.google.com (mail-qt1-f197.google.com [209.85.160.197]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-411-5Reo71u-OPylE7nMaFqG8w-1; Mon, 10 Aug 2020 10:57:04 -0400 X-MC-Unique: 5Reo71u-OPylE7nMaFqG8w-1 Received: by mail-qt1-f197.google.com with SMTP id u17so7644924qtq.13 for ; Mon, 10 Aug 2020 07:57:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=X94MiDwUendf6JmPezLRdErQYPoOiP1KjRkqlgzcETw=; b=ObY9dO12oNa0ECS4e6M0trhYx3vSf6HRRFGE/aoYiOgXmZVgbOIccMQa+auHzGPQOp 2oZ5adCX29tZJMgPJO9742+s3j/3dDcFBXEkDxirEdyqla4k+zmAbwN1jfZMbsVEf/x/ 8HoPBm4OJGGtpKQBm7DUSKFGIyoYT8OP7D76Svx8T8rWTF226yHO9q3nuYfabPMCJD2f rYU0puuI/bhNRt/UvrsVoa0x84m8JIdmX+50j6gDvv5/uqCB4bIQ4NSvlUxvdS5IrtE8 F4xH/y4CbHTGZ++1vjzU8Ozvm8YTjKDEM1z7+t+H18Rm8/9nEZPNhFuokFu0RCeCSJh4 Sq4A== X-Gm-Message-State: AOAM533lcqbohX9qMwoTKzZclzVC7RD31FSDgElDV19rI/jUBb1dU9V2 0GWm1sZrssGsiUpOjXZlA43kytlsQUJ3QxiScKqKymSETks9omrkkr042m5iNQUxjgGbowY7i+/ gX7jhGsEPMOs= X-Received: by 2002:a05:6214:542:: with SMTP id ci2mr27045354qvb.7.1597071423711; Mon, 10 Aug 2020 07:57:03 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzc6QGdaHDoibk9FZRvLZyjwh25Olf6AnPr1gUhvePBSPpxAgBDa+c1O2p7J0VDk5BRjnhyiA== X-Received: by 2002:a05:6214:542:: with SMTP id ci2mr27045321qvb.7.1597071423371; Mon, 10 Aug 2020 07:57:03 -0700 (PDT) Received: from xz-x1.redhat.com (bras-vprn-toroon474qw-lp130-11-70-53-122-15.dsl.bell.ca. [70.53.122.15]) by smtp.gmail.com with ESMTPSA id y3sm10557694qkd.132.2020.08.10.07.57.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Aug 2020 07:57:02 -0700 (PDT) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , peterx@redhat.com, Marty Mcfadden , Andrea Arcangeli , Linus Torvalds , Jann Horn , Christoph Hellwig , Oleg Nesterov , Kirill Shutemov , Jan Kara Subject: [PATCH v2] mm/gup: Allow real explicit breaking of COW Date: Mon, 10 Aug 2020 10:57:01 -0400 Message-Id: <20200810145701.129228-1-peterx@redhat.com> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII X-Rspamd-Queue-Id: 851FD1004D016 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Starting from commit 17839856fd58 ("gup: document and work around "COW ca= n break either way" issue", 2020-06-02), explicit copy-on-write behavior is enforced for private gup pages even if it's a read-only. It is achieved = by always passing FOLL_WRITE to emulate a write. That should fix the COW issue that we were facing, however above commit c= ould also break userfaultfd-wp and applications like umapsort [1,2]. One general routine of umap-like program is: userspace library will manag= e page allocations, and it will evict the least recently used pages from memory = to external storages (e.g., file systems). Below are the general steps to e= vict an in-memory page in the uffd service thread when the page pool is full: (1) UFFDIO_WRITEPROTECT with mode=3DWP on some to-be-evicted page P, so= that further writes to page P will block (keep page P clean) (2) Copy page P to external storage (e.g. file system) (3) MADV_DONTNEED to evict page P Here step (1) makes sure that the page to dump will always be up-to-date,= so that the page snapshot in the file system is consistent with the one that= was in the memory. However with commit 17839856fd58, step (2) can potentiall= y hang itself because e.g. if we use write() to a file system fd to dump the pag= e data, that will be a translated read gup request in the file system drive= r to read the page content, then the read gup will be translated to a write gu= p due to the new enforced COW behavior. This write gup will further trigger handle_userfault() and hang the uffd service thread itself. I think the problem will go away too if we replace the write() to the fil= e system into a memory write to a mmaped region in the userspace library, b= ecause normal page faults will not enforce COW, only gup is affected. However w= e cannot forbid users to use write() or any form of kernel level read gup. One solution is actually already mentioned in commit 17839856fd58, which = is to provide an explicit BREAK_COW scemantics for enforced COW. Then we can s= till use FAULT_FLAG_WRITE to identify whether this is a "real write request" o= r an "enfornced COW (read) request". [1] https://github.com/LLNL/umap-apps/blob/develop/src/umapsort/umapsort.= cpp [2] https://github.com/LLNL/umap CC: Marty Mcfadden CC: Andrea Arcangeli CC: Linus Torvalds CC: Andrew Morton CC: Jann Horn CC: Christoph Hellwig CC: Oleg Nesterov CC: Kirill Shutemov CC: Jan Kara Fixes: 17839856fd588f4ab6b789f482ed3ffd7c403e1f Signed-off-by: Peter Xu --- v2: - apply FAULT_FLAG_BREAK_COW correctly when FOLL_BREAK_COW [Christoph] - removed comments above do_wp_page which seems redundant --- include/linux/mm.h | 3 +++ mm/gup.c | 6 ++++-- mm/memory.c | 7 ++++--- 3 files changed, 11 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f6a82f9bccd7..dacba5c7942f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -409,6 +409,7 @@ extern pgprot_t protection_map[16]; * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal = signals. + * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read= ) * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -439,6 +440,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_REMOTE 0x80 #define FAULT_FLAG_INSTRUCTION 0x100 #define FAULT_FLAG_INTERRUPTIBLE 0x200 +#define FAULT_FLAG_BREAK_COW 0x400 =20 /* * The default fault flags that should be used by most of the @@ -2756,6 +2758,7 @@ struct page *follow_page(struct vm_area_struct *vma= , unsigned long address, #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ #define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page *= / #define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gu= p */ +#define FOLL_BREAK_COW 0x100000 /* request for explicit COW (even for r= ead) */ =20 /* * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with e= ach diff --git a/mm/gup.c b/mm/gup.c index d8a33dd1430d..c33e84ab9c36 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -870,6 +870,8 @@ static int faultin_page(struct task_struct *tsk, stru= ct vm_area_struct *vma, return -ENOENT; if (*flags & FOLL_WRITE) fault_flags |=3D FAULT_FLAG_WRITE; + if (*flags & FOLL_BREAK_COW) + fault_flags |=3D FAULT_FLAG_BREAK_COW; if (*flags & FOLL_REMOTE) fault_flags |=3D FAULT_FLAG_REMOTE; if (locked) @@ -1076,7 +1078,7 @@ static long __get_user_pages(struct task_struct *ts= k, struct mm_struct *mm, } if (is_vm_hugetlb_page(vma)) { if (should_force_cow_break(vma, foll_flags)) - foll_flags |=3D FOLL_WRITE; + foll_flags |=3D FOLL_BREAK_COW; i =3D follow_hugetlb_page(mm, vma, pages, vmas, &start, &nr_pages, i, foll_flags, locked); @@ -1095,7 +1097,7 @@ static long __get_user_pages(struct task_struct *ts= k, struct mm_struct *mm, } =20 if (should_force_cow_break(vma, foll_flags)) - foll_flags |=3D FOLL_WRITE; + foll_flags |=3D FOLL_BREAK_COW; =20 retry: /* diff --git a/mm/memory.c b/mm/memory.c index c39a13b09602..7659b0e27a98 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2900,7 +2900,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) { struct vm_area_struct *vma =3D vmf->vma; =20 - if (userfaultfd_pte_wp(vma, *vmf->pte)) { + if ((vmf->flags & FAULT_FLAG_WRITE) && + userfaultfd_pte_wp(vma, *vmf->pte)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_WP); } @@ -3290,7 +3291,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) put_page(swapcache); } =20 - if (vmf->flags & FAULT_FLAG_WRITE) { + if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) { ret |=3D do_wp_page(vmf); if (ret & VM_FAULT_ERROR) ret &=3D VM_FAULT_ERROR; @@ -4241,7 +4242,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault = *vmf) update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); goto unlock; } - if (vmf->flags & FAULT_FLAG_WRITE) { + if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) { if (!pte_write(entry)) return do_wp_page(vmf); entry =3D pte_mkdirty(entry); --=20 2.26.2