From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A5E8EC10F25 for ; Sat, 7 Mar 2020 21:47:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2DE1320684 for ; Sat, 7 Mar 2020 21:47:54 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="fkh2QGMj" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2DE1320684 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id BA5FE6B0005; Sat, 7 Mar 2020 16:47:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B56D86B0006; Sat, 7 Mar 2020 16:47:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A6C246B0007; Sat, 7 Mar 2020 16:47:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0064.hostedemail.com [216.40.44.64]) by kanga.kvack.org (Postfix) with ESMTP id 90F976B0005 for ; Sat, 7 Mar 2020 16:47:53 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 60177180AD806 for ; Sat, 7 Mar 2020 21:47:53 +0000 (UTC) X-FDA: 76569904026.09.brake94_61e7f5177b31f X-HE-Tag: brake94_61e7f5177b31f X-Filterd-Recvd-Size: 13030 Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) by imf26.hostedemail.com (Postfix) with ESMTP for ; Sat, 7 Mar 2020 21:47:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1583617672; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=r74SEp8ILoqShzcrneCL8Al2mAGKKUqCrNfiahijaJ4=; b=fkh2QGMjgPC2FeCsgS17wygpmruSvM2AZDTwqBB+eW8IIztSXleOLhqaNiyM7geTcT8uQz f7hsvMpWeUg6eBWyhl0e9ZU3SuEgn55AT3u9GyA1cfnjz0LiMMAfG+RmjaQxdFlmz/qUh0 9nWQHUWetnPbtoARvjJ3/LdtrbCX+Ks= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-71-YkSBhVbDM8-D5T0-s58wgQ-1; Sat, 07 Mar 2020 16:47:47 -0500 X-MC-Unique: YkSBhVbDM8-D5T0-s58wgQ-1 Received: by mail-qk1-f197.google.com with SMTP id 22so4400519qkc.7 for ; Sat, 07 Mar 2020 13:47:47 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=IkOBexO04RVBl2l8ypAUnEz4Z2T/51GfLf5xq5+aDvY=; b=KusKvHD47pS+NamiN6Dywwlkxm+MBW1RJGRGQsk29lPJGjOYD62hwRkYoiDO1+M8k3 oBsZe0SQT5hf5wJRY/mP/TD9eG9UHfG7nG80fLjM7aExsbkEQ6YTOL13JcpTFw+Wb+M3 da9VwlIUaGxxibrh+IjZF2E3gPp0z+/3dR8DrgFynS1adYTm7V+5+sMNa3ITNBm9lk9+ lbOgQ+wTigeyGHOQ+PT1YbLkFRfB316cwlfgcdCEBfMv+IaMSP95D5T+BgzyerpF1dnZ 6jr1tnev7/Ed3TcJBW+kdZ2doLnLbzcJSWE/7B1utRAFPm0Eg/EVL+y8NWxu6Mjk7+lW zIFQ== X-Gm-Message-State: ANhLgQ2/T+jkk7tQ29SjZvVFzGo9d0ARRsGDLl17/viPHmmQsgtrGrXj 6FxwcE6FNvt+ebpo+/Nsf2C9Zs8r+jJc4SWZGz72JsVfuWgqXod3J45bp4PTUKeu6jIAArEthnu Xpb34vFw8BPU= X-Received: by 2002:a37:606:: with SMTP id 6mr9063571qkg.194.1583617667257; Sat, 07 Mar 2020 13:47:47 -0800 (PST) X-Google-Smtp-Source: ADFU+vuFg3WiUjDadRTm8H6xACsKd8IRm59QlCeklYibzU4D4eDdDdU2T9UUIl41chl/6P+PgPuIqQ== X-Received: by 2002:a37:606:: with SMTP id 6mr9063548qkg.194.1583617666811; Sat, 07 Mar 2020 13:47:46 -0800 (PST) Received: from xz-x1 ([2607:9880:19c0:32::2]) by smtp.gmail.com with ESMTPSA id h21sm110044qtu.58.2020.03.07.13.47.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 07 Mar 2020 13:47:45 -0800 (PST) Date: Sat, 7 Mar 2020 16:47:43 -0500 From: Peter Xu To: David Hildenbrand Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Martin Cracauer , Linus Torvalds , Mike Rapoport , "Kirill A . Shutemov" , Johannes Weiner , "Dr . David Alan Gilbert" , Bobby Powers , Maya Gokhale , Jerome Glisse , Mike Kravetz , Matthew Wilcox , Marty McFadden , Mel Gorman , Hugh Dickins , Brian Geffon , Denis Plotnikov , Pavel Emelyanov Subject: Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements Message-ID: <20200307214743.GA4206@xz-x1> References: <20200220155353.8676-1-peterx@redhat.com> <1eb7bdd4-348f-da87-47a1-0b022b70e918@redhat.com> MIME-Version: 1.0 In-Reply-To: <1eb7bdd4-348f-da87-47a1-0b022b70e918@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Mar 07, 2020 at 09:33:08PM +0100, David Hildenbrand wrote: > On 20.02.20 16:53, Peter Xu wrote: > > [Resend v6] > >=20 > > This is v6 of the series. It is majorly a rebase to 5.6-rc2, nothing > > else to be expected (plus some tests after the rebase). Instead of > > rewrite the cover letter I decided to use what we have for v5. > >=20 > > Adding extra CCs for both Bobby Powers and > > Brian Geffon . > >=20 > > Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry > >=20 > > Any review comment is appreciated. Thanks, >=20 > If I am not completely missing something (and all my testing today was > wrong) there is a very simple reason why I *LOVE* this series and it made > my weekend. It makes userfaultfd with concurrent discarding (e.g., > MADV_DONTNEED) of pages actually usable. Hi, David, Thanks for doing further test against this branch! >=20 > The issue in current code is that between placing a page and waking > up a waiter, somebody can zap the new placed page and trigger > re-fault, triggering a SIGBUS and crashing an application where all > memory is supposed to be accessible. And there is no real way to protect > from that, because when the fault handler will be woken up and retry > is not deterministic (e.g., making madvise(MADV_DONTNEED) and > UFFDIO_ZEROPAGE mutually exclusive does not help). >=20 > Find a simple reproducer at the end of this mail. >=20 > Before this series: > [root@localhost ~]# ./a.out=20 > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > [ 34.849604] FAULT_FLAG_ALLOW_RETRY missing 70 > [ 34.850466] CPU: 1 PID: 651 Comm: a.out Not tainted 5.6.0-rc2+ #92 > [ 34.851525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO= S rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 > [ 34.852818] Call Trace: > [ 34.853045] dump_stack+0x8f/0xd0 > [ 34.853338] handle_userfault.cold+0x1a/0x2e > [ 34.853704] ? find_held_lock+0x2b/0x80 > [ 34.854031] ? __handle_mm_fault+0x18c5/0x1900 > [ 34.854409] __handle_mm_fault+0x18d4/0x1900 > [ 34.854784] handle_mm_fault+0x169/0x360 > [ 34.855120] do_user_addr_fault+0x20d/0x490 > [ 34.855478] async_page_fault+0x43/0x50 > [ 34.855809] RIP: 0033:0x401659 > [ 34.856069] Code: ba 1f 00 00 00 be 01 00 00 00 bf 10 21 40 00 e8 ad f= a ff ff bf ff ff ff ff e8 93 fa ff ff 48 8b8 > [ 34.857629] RSP: 002b:00007ffcfd536ec0 EFLAGS: 00010246 > [ 34.858076] RAX: 00007fcba86a4000 RBX: 0000000000000000 RCX: 00007fcba= 85784ef > [ 34.858675] RDX: 00007fcba86a4007 RSI: 00000000016524e0 RDI: 00007fcba= 864b320 > [ 34.859272] RBP: 00007ffcfd536f20 R08: 000000000000000a R09: 000000000= 0000070 > [ 34.859876] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000= 0401120 > [ 34.860472] R13: 00007ffcfd537000 R14: 0000000000000000 R15: 000000000= 0000000 >=20 > After this series: > Well, "Progress!" all day long. Yes, IIUC the race can happen like this in your below test: main thread uffd thread disgard thread =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D access page uffd page fault wait for page UFFDIO_ZEROCOPY put a page P there MADV_DONTNEED on P wakeup main thread return from fault page still missing uffd page fault again (without ALLOW_RETRY) --> SIGBUS. And I agree this should be one if the major issues that this series wants to address. >=20 >=20 > Can we please have a way to identify that this "feature" is available? > I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from > user space easily and use concurrent discards without crashing our applic= ations. I'm not sure how others think about it, but to me this still fells into the bucket of "solving an existing problem" rather than a feature. Also note that this should change the behavior for the page fault logic in general, rather than an uffd-only change. So I'm also not sure whether UFFD_FEAT_* suites here even if we want it. >=20 >=20 > Questions: > 1. I assume KVM will do multiple retries as well, and have the same behav= ior, right? Yes I believe so, or we'll need to fix it. >=20 > 2. What will happen if I don't place a page on a pagefault, but only do a= UFFDIO_WAKE? > For now we were able to trigger a signal this way. If I'm not mistaken the UFFDIO_WAKE will directly trigger the sigbus even without the help of the MADV_DONTNEED race. > If the behavior is changed, can > we make this configurable via a UFFD_FEAT? I'll still think that could be an overkill, but I'll leave the discussion to the experts. Thanks, >=20 > --- snip --- > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include >=20 > static int page_size; >=20 > static void *fault_handler_thread(void *arg) > { > const long uffd =3D (long) arg; > struct pollfd pollfd =3D { > .fd =3D uffd, > .events =3D POLLIN, > }; > int ret; >=20 > while (true) { > struct uffdio_zeropage zeropage =3D {}; > struct uffd_msg msg; > ssize_t nread; >=20 > if (poll(&pollfd, 1, -1) =3D=3D -1) { > fprintf(stderr, "POLL failed: %s\n", strerror(errno)); > exit(-1); > } > if (read(uffd, &msg, sizeof(msg)) !=3D sizeof(msg)) { > fprintf(stderr, "READ failed\n"); > exit(-1); > } > if (msg.event !=3D UFFD_EVENT_PAGEFAULT) { > fprintf(stderr, "Not UFFD_EVENT_PAGEFAULT\n"); > exit(-1); > } >=20 > zeropage.range.start =3D msg.arg.pagefault.address; > zeropage.range.len =3D page_size; > do { > ret =3D ioctl(uffd, UFFDIO_ZEROPAGE, &zeropage); > if (ret && errno !=3D EAGAIN) { > fprintf(stderr, "UFFDIO_ZEROPAGE failed:%s\n", strerror(e= rrno)); > exit(-1); > } > } while (ret); > } > } > static void *discard_thread(void *arg) > { > while (true) { > if (madvise(arg, page_size, MADV_DONTNEED)) { > fprintf(stderr, "MADV_DONTNEED failed:%s\n", strerror(errno))= ; > exit(-1); > } > usleep(1000); > } > } >=20 > int main(void) > { > struct uffdio_register reg; > struct uffdio_api api =3D { > .api =3D UFFD_API, > }; > pthread_t fault, discard; > long uffd; > char *area; >=20 > page_size =3D sysconf(_SC_PAGE_SIZE); >=20 > uffd =3D syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); > if (uffd =3D=3D -1) { > fprintf(stderr, "Could not create uffd: %s\n", strerror(errno)); > exit(-1); > } > if (ioctl(uffd, UFFDIO_API, &api) =3D=3D -1) { > fprintf(stderr, "UFFDIO_API failed: %s\n", strerror(errno)); > exit(-1); > } >=20 > area =3D mmap(NULL, page_size, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (area =3D=3D MAP_FAILED) { > fprintf(stderr, "Could not allocate memory"); > exit(-1); > } >=20 > reg.range.start =3D (uint64_t) area; > reg.range.len =3D page_size, > reg.mode =3D UFFDIO_REGISTER_MODE_MISSING; > if (ioctl(uffd, UFFDIO_REGISTER, ®) =3D=3D -1) { > fprintf(stderr, "UFFDIO_REGISTER failed: %s\n", strerror(errno)); > exit(-1); > } >=20 > /* thread to provide zeropages */ > if (pthread_create(&fault, NULL, fault_handler_thread, > (void *) uffd)) { > fprintf(stderr, "Could not create fault handling thread"); > exit(-1); > } >=20 > /* thread to discard the page */ > if (pthread_create(&discard, NULL, discard_thread, > (void *) area)) { > fprintf(stderr, "Could not create discard thread"); > exit(-1); > } >=20 > /* keep reading/writing the page */ > while (true) { > area[7] =3D area[1]; > usleep(10000); > printf("Progress!\n"); > } > return 0; > } >=20 >=20 > --=20 > Thanks, >=20 > David / dhildenb >=20 --=20 Peter Xu