From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.8 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6066CC433E0 for ; Sat, 8 Aug 2020 22:38:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DD524206C0 for ; Sat, 8 Aug 2020 22:38:11 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="bK+M2Hok" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DD524206C0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 578356B0002; Sat, 8 Aug 2020 18:38:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 528A36B0003; Sat, 8 Aug 2020 18:38:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3F4936B0005; Sat, 8 Aug 2020 18:38:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0113.hostedemail.com [216.40.44.113]) by kanga.kvack.org (Postfix) with ESMTP id 2A7CD6B0002 for ; Sat, 8 Aug 2020 18:38:11 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id DDFB5180AD801 for ; Sat, 8 Aug 2020 22:38:10 +0000 (UTC) X-FDA: 77128865940.24.crate23_3b17cc926fcc Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin24.hostedemail.com (Postfix) with ESMTP id B27151A4A0 for ; Sat, 8 Aug 2020 22:38:10 +0000 (UTC) X-HE-Tag: crate23_3b17cc926fcc X-Filterd-Recvd-Size: 10045 Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) by imf09.hostedemail.com (Postfix) with ESMTP for ; Sat, 8 Aug 2020 22:38:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1596926289; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=hm04VRSPNOWAWag6GBV43qqvwU0AE+7Ghd/jAhwulxw=; b=bK+M2Hok8+aX1kC4RIq+wOXv4voOOoxk2Yi3rI+g/iDyhP7v067KyHUGwZSGPtiEo5QZnC VjHlTB48BO2rl+Rfl2C4GSo39nhIEgF2cMlZKjn0GWQiYo/qGhi7WNPpetRw121n/a9DHv 9fhgF3iIZ/d7An7Fq7n90nkZ3dNVDBQ= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-435-irTnjpoWNPehAuny58Mnlw-1; Sat, 08 Aug 2020 18:38:06 -0400 X-MC-Unique: irTnjpoWNPehAuny58Mnlw-1 Received: by mail-qk1-f197.google.com with SMTP id 1so4241412qki.22 for ; Sat, 08 Aug 2020 15:38:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=hm04VRSPNOWAWag6GBV43qqvwU0AE+7Ghd/jAhwulxw=; b=jAj0DQYGrq+S1TGdBwYqSsngJ5vdRUrcslKt+Q1k3sfJ2yqgyc/MY5kEWxd+UwGLAe TADOhgjZyzjGbh5vBgs1NZPJ3nwpzyP2EnSNw+dcUwPOkGvHLgQHnKWjwdpm8GSWZ/XF TewVe16wW2qTdSTVO8UBqyZ04Qp9uyGDt5IcuJ65s+nwum3YE458r6KWgbVBNHgDMsy6 41PNWY02gUWYMrEOqfMSEIZ0nXAZcAzy0akjvLfmcXCIpb70Orv84wLZMltvMnOBrqLC EtPWF32nbJyD8UCx8/6R5kSxSC9ep2IEQwrRNzS5CfdActau9FsdeX3VnqQrYkiIfXuB cxxA== X-Gm-Message-State: AOAM530S2S2G9EkmdmlMkkIcpnKuL8n3MHViYMoMSESucqcon/oIrjHQ t/3Q0pq3qwKOqD+FchcUBXDh3H8z9CQDXoi/mT2zbFoK8OopZG7mxzeinzG2n4HvWFAG8avDs5Q O6t7YRQI48ZQ= X-Received: by 2002:a05:6214:1841:: with SMTP id d1mr5532527qvy.135.1596926285685; Sat, 08 Aug 2020 15:38:05 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzmPVIWEUNy8xotgKUcg/jyaIL5QwRv/NmfY6IdCt6CqLIa+lxmHQ6UDh54vnPKDo5WtuB7WQ== X-Received: by 2002:a05:6214:1841:: with SMTP id d1mr5532502qvy.135.1596926285378; Sat, 08 Aug 2020 15:38:05 -0700 (PDT) Received: from xz-x1.redhat.com (bras-vprn-toroon474qw-lp130-11-70-53-122-15.dsl.bell.ca. [70.53.122.15]) by smtp.gmail.com with ESMTPSA id h20sm9797647qkk.79.2020.08.08.15.38.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 08 Aug 2020 15:38:04 -0700 (PDT) From: Peter Xu To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: peterx@redhat.com, Andrew Morton , Marty Mcfadden , Andrea Arcangeli , Linus Torvalds , Jann Horn , Christoph Hellwig , Oleg Nesterov , Kirill Shutemov , Jan Kara Subject: [PATCH] mm/gup: Allow real explicit breaking of COW Date: Sat, 8 Aug 2020 18:38:02 -0400 Message-Id: <20200808223802.11451-1-peterx@redhat.com> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII X-Rspamd-Queue-Id: B27151A4A0 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Starting from commit 17839856fd58 ("gup: document and work around "COW ca= n break either way" issue", 2020-06-02), explicit copy-on-write behavior is enforced for private gup pages even if it's a read-only. It is achieved = by always passing FOLL_WRITE to emulate a write. That should fix the COW issue that we were facing, however above commit c= ould also break userfaultfd-wp and applications like umapsort [1,2]. One general routine of umap-like program is: userspace library will manag= e page allocations, and it will evict the least recently used pages from memory = to external storages (e.g., file systems). Below are the general steps to e= vict an in-memory page in the uffd service thread when the page pool is full: (1) UFFDIO_WRITEPROTECT with mode=3DWP on some to-be-evicted page P, so= that further writes to page P will block (keep page P clean) (2) Copy page P to external storage (e.g. file system) (3) MADV_DONTNEED to evict page P Here step (1) makes sure that the page to dump will always be up-to-date,= so that the page snapshot in the file system is consistent with the one that= was in the memory. However with commit 17839856fd58, step (2) can potentiall= y hang itself because e.g. if we use write() to a file system fd to dump the pag= e data, that will be a translated read gup request in the file system drive= r to read the page content, then the read gup will be translated to a write gu= p due to the new enforced COW behavior. This write gup will further trigger handle_userfault() and hang the uffd service thread itself. I think the problem will go away too if we replace the write() to the fil= e system into a memory write to a mmaped region in the userspace library, b= ecause normal page faults will not enforce COW, only gup is affected. However w= e cannot forbid users to use write() or any form of kernel level read gup. One solution is actually already mentioned in commit 17839856fd58, which = is to provide an explicit BREAK_COW scemantics for enforced COW. Then we can s= till use FAULT_FLAG_WRITE to identify whether this is a "real write request" o= r an "enfornced COW (read) request". [1] https://github.com/LLNL/umap-apps/blob/develop/src/umapsort/umapsort.= cpp [2] https://github.com/LLNL/umap CC: Marty Mcfadden CC: Andrea Arcangeli CC: Linus Torvalds CC: Andrew Morton CC: Jann Horn CC: Christoph Hellwig CC: Oleg Nesterov CC: Kirill Shutemov CC: Jan Kara Fixes: 17839856fd588f4ab6b789f482ed3ffd7c403e1f Signed-off-by: Peter Xu --- include/linux/mm.h | 3 +++ mm/gup.c | 4 ++-- mm/memory.c | 15 ++++++++++++--- 3 files changed, 17 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f6a82f9bccd7..dacba5c7942f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -409,6 +409,7 @@ extern pgprot_t protection_map[16]; * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal = signals. + * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read= ) * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -439,6 +440,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_REMOTE 0x80 #define FAULT_FLAG_INSTRUCTION 0x100 #define FAULT_FLAG_INTERRUPTIBLE 0x200 +#define FAULT_FLAG_BREAK_COW 0x400 =20 /* * The default fault flags that should be used by most of the @@ -2756,6 +2758,7 @@ struct page *follow_page(struct vm_area_struct *vma= , unsigned long address, #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ #define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page *= / #define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gu= p */ +#define FOLL_BREAK_COW 0x100000 /* request for explicit COW (even for r= ead) */ =20 /* * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with e= ach diff --git a/mm/gup.c b/mm/gup.c index d8a33dd1430d..02267f5797a7 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1076,7 +1076,7 @@ static long __get_user_pages(struct task_struct *ts= k, struct mm_struct *mm, } if (is_vm_hugetlb_page(vma)) { if (should_force_cow_break(vma, foll_flags)) - foll_flags |=3D FOLL_WRITE; + foll_flags |=3D FOLL_BREAK_COW; i =3D follow_hugetlb_page(mm, vma, pages, vmas, &start, &nr_pages, i, foll_flags, locked); @@ -1095,7 +1095,7 @@ static long __get_user_pages(struct task_struct *ts= k, struct mm_struct *mm, } =20 if (should_force_cow_break(vma, foll_flags)) - foll_flags |=3D FOLL_WRITE; + foll_flags |=3D FOLL_BREAK_COW; =20 retry: /* diff --git a/mm/memory.c b/mm/memory.c index c39a13b09602..0c819056374e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2900,7 +2900,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) { struct vm_area_struct *vma =3D vmf->vma; =20 - if (userfaultfd_pte_wp(vma, *vmf->pte)) { + if ((vmf->flags & FAULT_FLAG_WRITE) && + userfaultfd_pte_wp(vma, *vmf->pte)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_WP); } @@ -3290,7 +3291,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) put_page(swapcache); } =20 - if (vmf->flags & FAULT_FLAG_WRITE) { + /* + * We'll do a COW if it's a write or the caller wants explicit COW + * behavior (even if it's a read operation) + */ + if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) { ret |=3D do_wp_page(vmf); if (ret & VM_FAULT_ERROR) ret &=3D VM_FAULT_ERROR; @@ -4241,7 +4246,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault= *vmf) update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); goto unlock; } - if (vmf->flags & FAULT_FLAG_WRITE) { + /* + * We'll do a COW if it's a write or the caller wants explicit COW + * behavior (even if it's a read operation) + */ + if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) { if (!pte_write(entry)) return do_wp_page(vmf); entry =3D pte_mkdirty(entry); --=20 2.26.2