From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 086DFCD37B0 for ; Fri, 15 Sep 2023 23:34:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0C1186B03FA; Fri, 15 Sep 2023 19:34:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 071C56B03FB; Fri, 15 Sep 2023 19:34:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E7B5B6B03FC; Fri, 15 Sep 2023 19:34:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D462C6B03FA for ; Fri, 15 Sep 2023 19:34:12 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A5D231CA096 for ; Fri, 15 Sep 2023 23:34:12 +0000 (UTC) X-FDA: 81240437544.02.AA03562 Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com [209.85.128.53]) by imf10.hostedemail.com (Postfix) with ESMTP id CDFB8C0002 for ; Fri, 15 Sep 2023 23:34:10 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=1H6Lwhy5; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf10.hostedemail.com: domain of jannh@google.com designates 209.85.128.53 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694820850; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nHaVyOZGusvVmk37EQ+8MbC3XUzWT96lfRECaopM4TU=; b=ytqFEnSbBJ5aV1J+gTyk6M1JtkigF/0baX44ZvqI4zUc/eZL6Etkhu7sN4MqPu1ysToisM eAtJHy+owblqoh4oi7xpoBtli5LHWVU1CbZ77gczRkbGZ9P8nHkmuJRljjr7G6FJwf1yAX keryFwGExeyzAF1RzPy/ML8YSDhGKag= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=1H6Lwhy5; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf10.hostedemail.com: domain of jannh@google.com designates 209.85.128.53 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694820850; a=rsa-sha256; cv=none; b=0QjptQcM8SVItkNPzxJSRqS4nPBMZt3xPPDrNY0Yfmi8jAZBMSuRUubv3hMiwJ9/w6pqrM GGzVwPa2STmtFYJSBMPKO5VziXOUQWxMUz04xn3sKFsyYzkkUzoXv+n5WYFRQI0IQT20e6 x/YJKp87UtLZ2kyCnDw6lSrykdAcHYw= Received: by mail-wm1-f53.google.com with SMTP id 5b1f17b1804b1-4005f0a6c2bso13135e9.1 for ; Fri, 15 Sep 2023 16:34:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1694820849; x=1695425649; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nHaVyOZGusvVmk37EQ+8MbC3XUzWT96lfRECaopM4TU=; b=1H6Lwhy5bcufm9tagQMMSlJdx+OLt3YfSqbJX1mqwek/GEgn0Jq/cCDskLj8hlptrx u8vElJ15alhNJhF384ih/hChFr7bw6r1XqehclQjrT4pFCr0XrF4Da1mnuEYWoyXdw1e eDOL+jBcVWQfg+r8ZOKjiOuUxn7OujNJMaVY39MzPhLT1ljG2GLbP3MgmUFoLS175Yzk GUZO6Gr4ytd3o75v9zlcWdLVInUNRXAz1FqzQSjQ+fxzXyEfmJhFde+Xdfy8c+fhQ95h f9yTlMVTFKbVcdoapMqxQSJRsH6PAwIuSEhsiQxRFeRNnbeeU5pE0IbDWGN95J/F1PWf wCvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694820849; x=1695425649; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nHaVyOZGusvVmk37EQ+8MbC3XUzWT96lfRECaopM4TU=; b=K/mX/G5CFc2HQfOvcuwOAI/oZBT8sPBZMku3z3T/azkwJFgMA/PasXB1ePKY4s8hAu zcH/KHEePeFBAQH3p/98d2WFU0i26viMhU+iiq7mabno9o/KVw3Kuw0HF1gl1tHCr4jt WYvXSQEm1F/FhOWUzxOtlLbnCpbDDKnkVNdBEs8ka2hQuwtECTAJzXzdJml7POnFdbqd g89OD0deNkXZmGLF+e/KvbOSiVlebqP+xFpHJ4Ufjz2ooZTyYgtu49P30QEsPsumYH9F 4yQYg/kqh0QfHtZYES1wY7GPdYvfjC3lCu4/03cqTM7JVQtg/AE4kuzaIWY1gt662+oV HfzQ== X-Gm-Message-State: AOJu0YxO4c7Q7iXwrmN0c15gJzvTChsrtCPs66jxzSe/5yGzQqUUwVl7 fI8q9mpK+K+f+kB8CYxrS4gzQe7jVPlZQ84iEMFcJw== X-Google-Smtp-Source: AGHT+IGL2LH22GPgM/TSvARZol/MSfEHdefO1N8VSbuJFw1LRaOFenZJgOE8If3z0weLq5hWewOSVwfFd2dJwRe2N38= X-Received: by 2002:a05:600c:1f18:b0:3f4:fb7:48d4 with SMTP id bd24-20020a05600c1f1800b003f40fb748d4mr31754wmb.3.1694820848966; Fri, 15 Sep 2023 16:34:08 -0700 (PDT) MIME-Version: 1.0 References: <20230914152620.2743033-1-surenb@google.com> <20230914152620.2743033-3-surenb@google.com> In-Reply-To: <20230914152620.2743033-3-surenb@google.com> From: Jann Horn Date: Sat, 16 Sep 2023 01:33:32 +0200 Message-ID: Subject: Re: [PATCH 2/3] userfaultfd: UFFDIO_REMAP uABI To: Suren Baghdasaryan Cc: akpm@linux-foundation.org, viro@zeniv.linux.org.uk, brauner@kernel.org, shuah@kernel.org, aarcange@redhat.com, lokeshgidra@google.com, peterx@redhat.com, david@redhat.com, hughd@google.com, mhocko@suse.com, axelrasmussen@google.com, rppt@kernel.org, willy@infradead.org, Liam.Howlett@oracle.com, zhangpeng362@huawei.com, bgeffon@google.com, kaleshsingh@google.com, ngeoffray@google.com, jdduke@google.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: CDFB8C0002 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: fttpa3sr5oaar9ecid8gwnhard7ku8zf X-HE-Tag: 1694820850-256853 X-HE-Meta: U2FsdGVkX198tpI4b3mqkhVBr+q4b5OVKPFfoYDviNVSIfwSqTsGvwRAJ/09/Ydjf6ibE3W7j43fkdRaarCzXK/9XQuB9XmM/nsAzhn5M/YVYQ4kZVsDVe6z22OQx1MkPCQ3ApImpAKpLjYmlawEGhqPQMukN7voL7z6r3N9FGlUT9Vafe6U3LvR1yusmGBxy1Fs4y4JUqMIx+TTFVzhpkBMSjlpPIjzq/cXlFZX/g+TTzXnKpHdlxkMJRWs0hq2F85RUwqqtvfpR4s4NZH6/oCrGYxZsgUsca1cf55vRGTeYYJKTAr+52bmuYkC+fw+rS7pS3iZLfkynDo2I6iGlGFyhPhyrlOGECDWyid7jAsnFItRWNkI3ba0Zf9Zv1tKZcQUXdnVRDYCl5Fx2m6+hp24eM7flzZbYU68egOdhxxVL0+fq8yeep06cBQq8o7oFjVgWAn/s6BAxqDcRYtltzUBU72Dk+d4+jOfBFRY3AWFmkEV9sIVYpsiwbYN9PV+JvVmHALu1WceE6T5fRPGZP/YTCr3TsX6G9DxV++48S1ajoRBnEiSui26sS2AD9rAqqtkAqVo6ELLcIj72E50KxmTdzY6C9IEB6ZJ/p1clJRDUcDZ/3GGeuW0Tl5tYgZQTcIa+In0aJwcQBJfe4mDE+/i06XfsTmjqDtqlHSI/MVqHazYtqhALJGqZRG5Mf2VH3ksSMeavtS/woKRNhUgZPVxvlBh1p8L4aiqqQdr9htW2vZkTNyxil7onsDaQLeH4GUvbsZQSD50p1fCQFNqutEYh/Vh81lfXX3sQh8ML001qjAcyynAKq796P1K2FQwS4maDozqvV6P3K298rOiwC+jLmsR09ry/MiDDe8a9ozeeOXTvOT/170upskwIW6svB6q3JIHsuOgQ7jS4weDDXWSMEV48EFIcCKrtemkG00ur4E5UKBVlOeBEI27rPNyaEPZ8om6sDKBa17Zrt9 Ar8R/UOf 53m/b0MPMupeWFMS2B5syg+YwZ8SpBAYvshc8UeksOl+VHsjCD4q+Tv0vdZbk8pGC1xCqZCAsNtKyNjLjt22vl0SSN2CV0foBZwSqXaCz2TD5Ty2flcZ24eK0VLHvUzur9TG0tZgLPTTYqRPhU4zWr04KLinGqNuMcfZgvW6ZkgU1KLSG8Uf0a3EnBQ8VLICQgbHdJRn58nz4TRAi6MXEg45ilAlRZbBOAFgNpfbDMKMeJ6v9QWwFUVjXG3bwMFr1f5nVcEZMgzpN+hKKSP1rwzI9o9aHyxkeQFk3HCwvK/gelNo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Sep 14, 2023 at 5:26=E2=80=AFPM Suren Baghdasaryan wrote: > +/* > + * The mmap_lock for reading is held by the caller. Just move the page > + * from src_pmd to dst_pmd if possible, and return true if succeeded > + * in moving the page. > + */ > +static int remap_pages_pte(struct mm_struct *dst_mm, > + struct mm_struct *src_mm, > + pmd_t *dst_pmd, > + pmd_t *src_pmd, > + struct vm_area_struct *dst_vma, > + struct vm_area_struct *src_vma, > + unsigned long dst_addr, > + unsigned long src_addr, > + __u64 mode) > +{ > + swp_entry_t entry; > + pte_t orig_src_pte, orig_dst_pte; > + spinlock_t *src_ptl, *dst_ptl; > + pte_t *src_pte =3D NULL; > + pte_t *dst_pte =3D NULL; > + > + struct folio *src_folio =3D NULL; > + struct anon_vma *src_anon_vma =3D NULL; > + struct anon_vma *dst_anon_vma; > + struct mmu_notifier_range range; > + int err =3D 0; > + > +retry: > + dst_pte =3D pte_offset_map_nolock(dst_mm, dst_pmd, dst_addr, &dst= _ptl); > + > + /* If an huge pmd materialized from under us fail */ > + if (unlikely(!dst_pte)) { > + err =3D -EFAULT; > + goto out; > + } > + > + src_pte =3D pte_offset_map_nolock(src_mm, src_pmd, src_addr, &src= _ptl); > + > + /* > + * We held the mmap_lock for reading so MADV_DONTNEED > + * can zap transparent huge pages under us, or the > + * transparent huge page fault can establish new > + * transparent huge pages under us. > + */ > + if (unlikely(!src_pte)) { > + err =3D -EFAULT; > + goto out; > + } > + > + BUG_ON(pmd_none(*dst_pmd)); > + BUG_ON(pmd_none(*src_pmd)); > + BUG_ON(pmd_trans_huge(*dst_pmd)); > + BUG_ON(pmd_trans_huge(*src_pmd)); > + > + spin_lock(dst_ptl); > + orig_dst_pte =3D *dst_pte; > + spin_unlock(dst_ptl); > + if (!pte_none(orig_dst_pte)) { > + err =3D -EEXIST; > + goto out; > + } > + > + spin_lock(src_ptl); > + orig_src_pte =3D *src_pte; > + spin_unlock(src_ptl); > + if (pte_none(orig_src_pte)) { > + if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES)) > + err =3D -ENOENT; > + else /* nothing to do to remap a hole */ > + err =3D 0; > + goto out; > + } > + > + if (pte_present(orig_src_pte)) { > + if (!src_folio) { > + struct folio *folio; > + > + /* > + * Pin the page while holding the lock to be sure= the > + * page isn't freed under us > + */ > + spin_lock(src_ptl); > + if (!pte_same(orig_src_pte, *src_pte)) { > + spin_unlock(src_ptl); > + err =3D -EAGAIN; > + goto out; > + } > + > + folio =3D vm_normal_folio(src_vma, src_addr, orig= _src_pte); > + if (!folio || !folio_test_anon(folio) || > + folio_estimated_sharers(folio) !=3D 1) { > + spin_unlock(src_ptl); > + err =3D -EBUSY; > + goto out; > + } > + > + src_folio =3D folio; > + folio_get(src_folio); > + spin_unlock(src_ptl); > + > + /* try to block all concurrent rmap walks */ > + if (!folio_trylock(src_folio)) { > + pte_unmap(&orig_src_pte); > + pte_unmap(&orig_dst_pte); > + src_pte =3D dst_pte =3D NULL; > + folio_lock(src_folio); > + goto retry; > + } > + } > + > + if (!src_anon_vma) { > + /* > + * folio_referenced walks the anon_vma chain > + * without the folio lock. Serialize against it w= ith > + * the anon_vma lock, the folio lock is not enoug= h. > + */ > + src_anon_vma =3D folio_get_anon_vma(src_folio); > + if (!src_anon_vma) { > + /* page was unmapped from under us */ > + err =3D -EAGAIN; > + goto out; > + } > + if (!anon_vma_trylock_write(src_anon_vma)) { > + pte_unmap(&orig_src_pte); > + pte_unmap(&orig_dst_pte); > + src_pte =3D dst_pte =3D NULL; > + anon_vma_lock_write(src_anon_vma); > + goto retry; > + } > + } > + > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_= mm, > + src_addr, src_addr + PAGE_SIZE); > + mmu_notifier_invalidate_range_start_nonblock(&range); I'm pretty sure you're not allowed to use mmu_notifier_invalidate_range_start_nonblock(). Use mmu_notifier_invalidate_range_start() at a point where it's fine to block; perhaps all the way up in remap_pages() around the whole loop or something like that. mmu_notifier_invalidate_range_start_nonblock() is special magic sauce for OOM reaping (with deceptively inviting naming and no documentation that explains that this is evil hazmat code), and if you chase down what various users of MMU notifiers do when you just hand them a random notification where mmu_notifier_range_blockable() is false, you end up in fun paths like in KVM's kvm_mmu_notifier_invalidate_range_start(), which calls gfn_to_pfn_cache_invalidate_start(), which does this: /* * If the OOM reaper is active, then all vCPUs should have * been stopped already, so perform the request without * KVM_REQUEST_WAIT and be sad if any needed to be IPI'd. */ if (!may_block) req &=3D ~KVM_REQUEST_WAIT; called =3D kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap); WARN_ON_ONCE(called && !may_block); Which means if you hit this codepath from a place like userfaultfd where the process is probably still running (and not being OOM reaped in a context where it's guaranteed to have been killed and stopped), you could probably get KVM into a situation where it needs to wait for vCPUs to acknowledge that they've processed a message before it's safe to continue, but it can't wait because we're in nonsleepable context, and then it just continues without waiting and presumably the KVM TLB gets out of sync or something? > + > + double_pt_lock(dst_ptl, src_ptl); > + > + if (!pte_same(*src_pte, orig_src_pte) || > + !pte_same(*dst_pte, orig_dst_pte) || > + folio_estimated_sharers(src_folio) !=3D 1) { > + double_pt_unlock(dst_ptl, src_ptl); > + err =3D -EAGAIN; > + goto out; > + } > + > + BUG_ON(!folio_test_anon(src_folio)); > + /* the PT lock is enough to keep the page pinned now */ > + folio_put(src_folio); > + > + dst_anon_vma =3D (void *) dst_vma->anon_vma + PAGE_MAPPIN= G_ANON; > + WRITE_ONCE(src_folio->mapping, > + (struct address_space *) dst_anon_vma); > + WRITE_ONCE(src_folio->index, linear_page_index(dst_vma, > + dst_addr)); > + > + if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte= ), > + orig_src_pte)) > + BUG_ON(1); > + > + orig_dst_pte =3D mk_pte(&src_folio->page, dst_vma->vm_pag= e_prot); > + orig_dst_pte =3D maybe_mkwrite(pte_mkdirty(orig_dst_pte), > + dst_vma); > + > + set_pte_at(dst_mm, dst_addr, dst_pte, orig_dst_pte); > + > + if (dst_mm !=3D src_mm) { > + inc_mm_counter(dst_mm, MM_ANONPAGES); > + dec_mm_counter(src_mm, MM_ANONPAGES); > + } > + > + double_pt_unlock(dst_ptl, src_ptl); > + > + anon_vma_unlock_write(src_anon_vma); > + mmu_notifier_invalidate_range_end(&range); > + put_anon_vma(src_anon_vma); > + src_anon_vma =3D NULL; > + > + /* unblock rmap walks */ > + folio_unlock(src_folio); > + src_folio =3D NULL; > + > + } else { > + struct swap_info_struct *si; > + int swap_count; > + > + entry =3D pte_to_swp_entry(orig_src_pte); > + if (non_swap_entry(entry)) { > + if (is_migration_entry(entry)) { > + pte_unmap(&orig_src_pte); > + pte_unmap(&orig_dst_pte); > + src_pte =3D dst_pte =3D NULL; > + migration_entry_wait(src_mm, src_pmd, > + src_addr); > + err =3D -EAGAIN; > + } else > + err =3D -EFAULT; > + goto out; > + } > + > + /* > + * COUNT_CONTINUE to be returned is fine here, no need > + * of follow all swap continuation to check against > + * number 1. > + */ > + si =3D get_swap_device(entry); > + if (!si) { > + err =3D -EBUSY; > + goto out; > + } > + > + swap_count =3D swap_swapcount(si, entry); > + put_swap_device(si); > + if (swap_count !=3D 1) { > + err =3D -EBUSY; > + goto out; > + } > + > + double_pt_lock(dst_ptl, src_ptl); > + > + if (!pte_same(*src_pte, orig_src_pte) || > + !pte_same(*dst_pte, orig_dst_pte) || > + swp_swapcount(entry) !=3D 1) { > + double_pt_unlock(dst_ptl, src_ptl); > + err =3D -EAGAIN; > + goto out; > + } > + > + if (pte_val(ptep_get_and_clear(src_mm, src_addr, src_pte)= ) !=3D > + pte_val(orig_src_pte)) > + BUG_ON(1); > + set_pte_at(dst_mm, dst_addr, dst_pte, orig_src_pte); > + > + if (dst_mm !=3D src_mm) { > + inc_mm_counter(dst_mm, MM_ANONPAGES); > + dec_mm_counter(src_mm, MM_ANONPAGES); > + } > + > + double_pt_unlock(dst_ptl, src_ptl); > + } > + > +out: > + if (src_anon_vma) { > + anon_vma_unlock_write(src_anon_vma); > + put_anon_vma(src_anon_vma); > + } > + if (src_folio) { > + folio_unlock(src_folio); > + folio_put(src_folio); > + } > + if (dst_pte) > + pte_unmap(dst_pte); > + if (src_pte) > + pte_unmap(src_pte); > + > + return err; > +}