From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 478D4C43334 for ; Tue, 12 Jul 2022 14:56:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 964716B0131; Tue, 12 Jul 2022 10:56:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9143F6B0132; Tue, 12 Jul 2022 10:56:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B64B6B0133; Tue, 12 Jul 2022 10:56:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 6A56D6B0131 for ; Tue, 12 Jul 2022 10:56:40 -0400 (EDT) Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B646533C94 for ; Tue, 12 Jul 2022 14:56:39 +0000 (UTC) X-FDA: 79678749318.31.D835EC3 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf07.hostedemail.com (Postfix) with ESMTP id 02E5C40039 for ; Tue, 12 Jul 2022 14:56:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1657637798; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1HgpPzI1xZX5q69kbQgOS90gaWeI6jGlKC/585RYdZM=; b=Cp1QV6dw65Jc7U1sEyWjZbjRWCO19VeDPJpjptSFkV03PEHv7fTzQ2602caDeR124PXq15 E7w9F8vvfRum6DHdSi0pyezwZ2o8eeQz9PPO0QTETUe4WNFTNw+XMoNMktwoyK49r3gBMx viunSZvGI0oH9aWHry3cQHrGwdTcmwE= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-179-0P5nu_2UNqeIn-8vM10nQw-1; Tue, 12 Jul 2022 10:56:37 -0400 X-MC-Unique: 0P5nu_2UNqeIn-8vM10nQw-1 Received: by mail-qk1-f200.google.com with SMTP id bl27-20020a05620a1a9b00b0069994eeb30cso7935628qkb.11 for ; Tue, 12 Jul 2022 07:56:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=1HgpPzI1xZX5q69kbQgOS90gaWeI6jGlKC/585RYdZM=; b=cG2WJlJH8V4R7ixCh18IZj8Ka9Kjp0RzMdfzZ6LgrJAfwHqXrbXHiFx3KJ/sVH9bYl qlCT5AVYOXW+3PJm+OD4Nh0Y+TZVb5R79Ga4ZMq+mmY/kpSzuHe3toy70jG4NpforIDd Os8S/rtjGC63RBPlDNGS73GTfnbIUuTcxxtyzjjlNNNQG8lfCNbZ0N1P1oTq2dQfE1be ENcLKZ/FxV7DqJPbrysmzLzTTnliIpCmTXMeZCMDr7HOGq1wZJVifyqzxFzW2v4Z7rim lk1e1r4KJ3DYYBSqqTW5KzhY5pqn1Uc6Lyr+Hceo/KZWTAzH9lpR63w5WCc7duhhBXzK Aa7Q== X-Gm-Message-State: AJIora9BGDzdpkXYl/q+G3mXvRVPwgo3Gl44gz25d3wZcsUXaHrOu/7i mb8CQAhvq9KnIxahOnl4wf5NfYJf64UxXyNAr/jM/lYbo3g+ybtFCbho+r10LUjawEK0T+W1odx /CCJCNaZjA7A= X-Received: by 2002:ac8:7d45:0:b0:31d:31e5:999b with SMTP id h5-20020ac87d45000000b0031d31e5999bmr19113957qtb.247.1657637796707; Tue, 12 Jul 2022 07:56:36 -0700 (PDT) X-Google-Smtp-Source: AGRyM1s7PTLizHcs65S/LtCRcZlcfW+dInL+07cAItr9rqN0bMVQD/6tgW788IVTwq/rgNjhYqFelg== X-Received: by 2002:ac8:7d45:0:b0:31d:31e5:999b with SMTP id h5-20020ac87d45000000b0031d31e5999bmr19113925qtb.247.1657637796288; Tue, 12 Jul 2022 07:56:36 -0700 (PDT) Received: from xz-m1.local (bras-base-aurron9127w-grc-37-74-12-30-48.dsl.bell.ca. [74.12.30.48]) by smtp.gmail.com with ESMTPSA id bp33-20020a05620a45a100b006a65c58db99sm9406698qkb.64.2022.07.12.07.56.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Jul 2022 07:56:35 -0700 (PDT) Date: Tue, 12 Jul 2022 10:56:34 -0400 From: Peter Xu To: Nadav Amit Cc: Linux MM , Mike Kravetz , Hugh Dickins , Andrew Morton , Axel Rasmussen , David Hildenbrand , Mike Rapoport , Nadav Amit Subject: Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations Message-ID: References: <20220622185038.71740-1-namit@vmware.com> <20220622185038.71740-3-namit@vmware.com> <5D85870C-CBDF-45F7-A3A5-5F889521BE41@vmware.com> MIME-Version: 1.0 In-Reply-To: <5D85870C-CBDF-45F7-A3A5-5F889521BE41@vmware.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657637799; a=rsa-sha256; cv=none; b=a+rzNn2468oGF3daZkMqbtTfKb/QR9/soicI09tkKaZD38IMakzvw8/vdCWMwzWfh+p0+F 94fZzOz0TbA2AmY0uo5DLCjB89l00nmacHdC2U8OSt0IdJCnIOeJ0QwjwB0ooRQ7A25Vxr EdAX6dLc5ELriTYqi3B8zc0wr5UDjcM= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Cp1QV6dw; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf07.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657637799; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1HgpPzI1xZX5q69kbQgOS90gaWeI6jGlKC/585RYdZM=; b=KZRy0FqKGDgI+6vxqshjYp9ryKMhcd8rtFjr7cdgEKVpDtp3donWXCunAHyWCG/6Qaoi8J 8Q4kei5LbP59gf0EOrufN3PJ4CPlbFiLeYMzFslaIKEe/Rg+NAqfvboYC4XJRhsHsZbG4L IqTEAEfqI95KKJlng2O6j2+4FRdn/Bs= Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Cp1QV6dw; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf07.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com X-Rspam-User: X-Stat-Signature: sfgrj5rx9ibahhecu6wx6mf7hrns9st1 X-Rspamd-Queue-Id: 02E5C40039 X-Rspamd-Server: rspam08 X-HE-Tag: 1657637798-408337 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, Nadav, On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote: > On Jun 22, 2022, at 11:50 AM, Nadav Amit wrote: > > > From: Nadav Amit > > > > Using a PTE on x86 with cleared access-bit (aka young-bit) > > takes ~600 cycles more than when the access bit is set. At the same > > time, setting the access-bit for memory that is not used (e.g., > > prefetched) can introduce greater overheads, as the prefetched memory is > > reclaimed later than it should be. > > > > Userfaultfd currently does not set the access-bit (excluding the > > huge-pages case). Arguably, it is best to let the user control whether > > the access bit should be set or not. The expected use is to request > > userfaultfd to set the access-bit when the copy/wp operation is done to > > resolve a page-fault, and not to set the access-bit when the memory is > > prefetched. > > > > Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the > > young bit to be set. > > I reply to my own email, but this mostly addresses the concerns that Peter > has raised. > > So I ran the test below on my Haswell (x86), which showed two things: > > 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles > more than with dirty+young (depending on the access, of course: dirty > does not matter for read, dirty+young both matter for write). > > 2. I made a mistake in my implementation. PTEs are - at least on x86 - > created as young with mk_pte(). So the logic should be similar to > do_set_pte(): > > if (prefault && arch_wants_old_prefaulted_pte()) > entry = pte_mkold(entry); > else > entry = pte_sw_mkyoung(entry); > > Based on these results, I will send another version for both young and > dirty. Let me know if these results are not convincing. Thanks for trying to verify this idea, but I'm not fully sure this is what my concern was on WRITE_LIKELY. AFAICT the test below was trying to measure the overhead of hardware setting either access or dirty or both bits when they're not set for read/write. What I wanted as a justification is whether WRITE_LIKELY would be helpful in any real world scenario at all. AFAIK the only way to prove it so far is to measure any tlb flush difference (probably only on x86, since that tlb code is only compiled on x86) that may trigger with W=0,D=1 but may not trigger with W=0,D=0 (where W stands for "write bit", and D stands for "dirty bit"). It's not about the slowness when D is cleared. The core thing is (sorry to rephrase, but just hope we're on the same page) we'll set D bit always for all uffd pages so far. Even if we want to change that behavior so we skip setting D bit for RO pages (we'll need to convert the dirty bit into PageDirty though), we'll still always set D bit for writable pages. So we always set D bit as long as possible and we'll never suffer from hardware overhead on setting D bit for uffd pages. The other worry of having WRITE_HINT is, after we have it we probably need to _not_ apply dirty bit when WRITE_HINT is not set (which is actually a very light ABI change since we used to always set it), then I'll start to worry the hardware setting D bit overhead you just measured because we'll have that overhead when user didn't specify WRITE_HINT with the old code. So again, I'm totally fine if you want to start with ACCESS_HINT only, but I still don't see why we should need WRITE_HINT too.. Thanks, > > I will add, as we discussed (well, I think I raised these things, so > hopefully you agree): > > 1. On x86, avoid flush if changing WP->RO and PTE is clean. > > 2. When write-unprotecting entry, if PTE is exclusive, set it as writable. > [ I considered not setting it as writable if write-hint is not provided, but > with the change in (1), it does not provide any real value. ] > > --- > > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > > #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ > } while (0) > > static inline uint64_t rdtscp(void) > { > uint64_t rax, rdx; > uint32_t aux; > asm volatile ("rdtscp" : "=a" (rax), "=d" (rdx), "=c" (aux):: "memory"); > } > > int main(int argc, char *argv[]) > { > long uffd; /* userfaultfd file descriptor */ > char *addr; /* Start of region handled by userfaultfd */ > unsigned long len; /* Length of region handled by userfaultfd */ > pthread_t thr; /* ID of thread that handles page faults */ > bool young, dirty, write; > struct uffdio_api uffdio_api; > struct uffdio_register uffdio_register; > int l; > static char *page = NULL; > struct uffdio_copy uffdio_copy; > ssize_t nread; > int page_size; > > if (argc != 5) { > fprintf(stderr, "Usage: %s [num-pages] [write] [young] [dirty]\n", argv[0]); > exit(EXIT_FAILURE); > } > > page_size = sysconf(_SC_PAGE_SIZE); > len = strtoul(argv[1], NULL, 0) * page_size; > write = !!strtoul(argv[2], NULL, 0); > young = !!strtoul(argv[3], NULL, 0); > dirty = !!strtoul(argv[4], NULL, 0); > > page = mmap(NULL, page_size, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > if (page == MAP_FAILED) > errExit("mmap"); > > uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); > if (uffd == -1) > errExit("userfaultfd"); > > uffdio_api.api = UFFD_API; > uffdio_api.features = (1<<11); //UFFD_FEATURE_EXACT_ADDRESS; > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) > errExit("ioctl-UFFDIO_API"); > > addr = mmap(NULL, len, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (addr == MAP_FAILED) > errExit("mmap"); > > uffdio_register.range.start = (unsigned long) addr; > uffdio_register.range.len = len; > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) > errExit("ioctl-UFFDIO_REGISTER"); > > uffdio_copy.src = (unsigned long) page; > uffdio_copy.mode = 0; > if (young) > uffdio_copy.mode |= (1ul << 2); > if (dirty) > uffdio_copy.mode |= (1ul << 3); > > uffdio_copy.len = page_size; > uffdio_copy.copy = 0; > > for (l = 0; l < len; l += page_size) { > uffdio_copy.dst = (unsigned long)(&addr[l]); > if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1) > errExit("ioctl-UFFDIO_COPY"); > } > > for (l = 0; l < len; l += page_size) { > char c; > uint64_t start; > > start = rdtscp(); > if (write) > addr[l] = 5; > else > c = *(volatile char *)(&addr[l]); > printf("%ld\n", rdtscp() - start); > } > > exit(EXIT_SUCCESS); > } > -- Peter Xu