From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6FA33C38A02 for ; Fri, 28 Oct 2022 23:57:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 79A318E0001; Fri, 28 Oct 2022 19:57:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 749E56B0073; Fri, 28 Oct 2022 19:57:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6114F8E0001; Fri, 28 Oct 2022 19:57:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 526596B0072 for ; Fri, 28 Oct 2022 19:57:38 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 223EA120C19 for ; Fri, 28 Oct 2022 23:57:38 +0000 (UTC) X-FDA: 80072022996.30.538BA43 Received: from mail-pj1-f53.google.com (mail-pj1-f53.google.com [209.85.216.53]) by imf24.hostedemail.com (Postfix) with ESMTP id B69DB180010 for ; Fri, 28 Oct 2022 23:57:37 +0000 (UTC) Received: by mail-pj1-f53.google.com with SMTP id b11so5937004pjp.2 for ; Fri, 28 Oct 2022 16:57:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=dL8ynURpMtdh0kmGkVrhnVStQaki4uWpAqyDuPgG8YE=; b=GzQMUs3Z324t8mnJ8BGeMc8+zTFpD3QpLW3anlsOhMUQ7HNardszqaYMyDcit6EYX7 j8xi/2FQGHyrRk9CaRVY8UzNF+EF8z9xj2D3EBhhAA+HJSMpML/eT9tuFkgUJtT1JSjc HysWAMRNtkuoliLgGaXccxmOw+y29VixR/MJjwyhQrYQEVCH1Fl0BfwtCyo3THeVinbk GRykLUgZShx6y+D2Bq/ANuhl0AvymD0RUel04UfyCTO9WxLsLbRPRzod4FXHO8sUHljD oUGNFW2Xs6wFpvvtT/O7Zn01KbfgjkIN7sA5OFCFEtoPL3/R/CGav4CWTJ9Y7oHVtKHW K86A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dL8ynURpMtdh0kmGkVrhnVStQaki4uWpAqyDuPgG8YE=; b=CK+WeHk529QzwE6jdwHPZJ7WHLLZCZOKDedMGMJJs6EV5X3y/TFIVyPeCVFLxH5or+ G9aQucobJpUSEhUPG7SRgnfbW941/xnDAsgfmt8XIUobaU27oGaMQNaJ+Y0QW/JDsmrQ BToaIFhBmigUWnov3rYl2cQ7Dvo/WDAiQtUwQFdqIdKUYdk+jgQzWUtivnfjew+aqTH8 aSP8sJS/6tWbpzG6ctUmtahBEY7touUoMxNFoX3ny23MDYyinzbYtW6ZYjY29Mq6qs8d yS/4/lvgXCBiB6Cy/xa6sR+AwtNVZo/WT/VYSOKqonrdL4fWUunXUPVI/r2lZ47GU85I NXTQ== X-Gm-Message-State: ACrzQf3RhuEGeo9CjH4fONKqdA6XsY/7MY8Ew27K0fWdvtbYAipeUUTH nxt6RHT0E44mFnHTAslFAkQ= X-Google-Smtp-Source: AMsMyM4dq2JCZiaeX1ZMkDakggfXQIU8/QR5KwQl779jicrVW2qJFszu81cT4MPos2VSSbWZaesFlQ== X-Received: by 2002:a17:90a:2b42:b0:213:82b3:212b with SMTP id y2-20020a17090a2b4200b0021382b3212bmr1931225pjc.106.1667001456289; Fri, 28 Oct 2022 16:57:36 -0700 (PDT) Received: from smtpclient.apple ([66.170.99.95]) by smtp.gmail.com with ESMTPSA id a19-20020aa79713000000b0056bc1a41209sm36234pfg.33.2022.10.28.16.57.34 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 28 Oct 2022 16:57:35 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.1\)) Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment From: Nadav Amit In-Reply-To: Date: Fri, 28 Oct 2022 16:57:32 -0700 Cc: Jann Horn , John Hubbard , X86 ML , Matthew Wilcox , Andrew Morton , kernel list , Linux-MM , Andrea Arcangeli , "Kirill A . Shutemov" , jroedel@suse.de, ubizjak@gmail.com, Linus Torvalds Content-Transfer-Encoding: quoted-printable Message-Id: References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org> <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> <6C548A9A-3AF3-4EC1-B1E5-47A7FFBEB761@gmail.com> To: Peter Zijlstra X-Mailer: Apple Mail (2.3696.120.41.1.1) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667001457; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dL8ynURpMtdh0kmGkVrhnVStQaki4uWpAqyDuPgG8YE=; b=Szk88mGF2AB6VifPozioRYRZCY/k3j7K1nQ6THCPSWySbO7Po74nCh/4N1OlwYswUC9LUf K+gWhcIZ0skhYfecNVNWhaozLlqMLVp9tErhOcQLrTWP3Tmx7IfDt58wtfKb7NpXUE642d 0atDjy/rrqNQFZv148sTr2zSOES1Qho= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=GzQMUs3Z; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.216.53 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667001457; a=rsa-sha256; cv=none; b=BMm9M+9xB1eJGzmLKxsfX6btdvhIsj5+RTbNwpEe/AlkNWFTFDDKFecU4JZ4+Z7sQLFr2+ w+ZWXC4S58+RRSSfbeMzvbkVHJ8QSOu6gfRuoUg+cHd0Fu79X5g8QtHwzwt+wmcqoQcHU7 EssTMkYendZOkwW6940B67mNyfHXRxY= X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: B69DB180010 X-Rspam-User: Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=GzQMUs3Z; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.216.53 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com X-Stat-Signature: iy7z485tpkcn6kwx85q5bu35rnqk1u6r X-HE-Tag: 1667001457-506755 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Oct 27, 2022, at 2:44 PM, Nadav Amit wrote: > Anyhow, admittedly, I need to give it more thought. For instance, in = respect > to the code that you mentioned (in zap_pte_range), after reading it = again, > it seems strange: how is ok to defer the TLB flush after the rmap was > removed, even if it is done while the PTL is held. > folio_clear_dirty_for_io() would not sync on the PTL afterwards, so = the page > might be later re-dirtied using a stale cached PTE. Oh well. Peter, So it appears to be a problem - flushing after removing the rmap. I = attach a PoC that shows the problem. The problem is in the following code of zap_pte_range(): if (!PageAnon(page)) { if (pte_dirty(ptent)) { force_flush =3D 1; set_page_dirty(page); } =E2=80=A6 } page_remove_rmap(page, vma, false); Once we remove the rmap, rmap_walk() would not acquire the page-table = lock anymore. As a result, nothing prevents the kernel from performing = writeback and cleaning the page-struct dirty-bit before the TLB is actually = flushed. As a result, if there is an additional thread that has the dirty-PTE = cached in the TLB, it can keep writing to the page and nothing = (PTE/page-struct) will keep track that the page has been dirtied. In other words, writes to the memory mapped file after munmap()/MADV_DONTNEED started can be lost. This affects both munmap() and MADV_DONTNEED. One might argue that if = you don=E2=80=99t keep writing after munmap()/MADV_DONTNEED it=E2=80=99s = your fault (questionable), but if that=E2=80=99s the case why do we bother with = force_flush at all? If we want it to behave correctly - i.e., writes after = munmap/MADV_DONTNEED to propagate to the file - we need to collect dirty pages and remove = their rmap only after we flush the TLB (and before we release the ptl). = mmu_gather would probably need to be changed for this matter. Thoughts? --- Some details about the PoC (below): We got 3 threads that use 2MB range: 1. Maintains a counter in the first 4KB of the 2MB range and checks it actually updates the memory. 2. Dirties pages in 2MB range (just to make the race more likely). 3. Either (i) maps the same mapping at the first 4KB (to test munmap indirectly); or (ii) runs MADV_DONTNEED on the first 4KB. In addition, a child process runs msync and fdatasync to writeback the = first 4KB. The PoC should be run with a file on a block RAM device. It manages to trigger the issue relatively reliably (within 60 seconds) with munmap() = and slightly less reliably with MADV_DONTNEED. I have no idea if it works in = VM, deep C-states, etc.. -- >8 -- #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #define handle_error(msg) \ do { perror(msg); exit(EXIT_FAILURE); } while (0) void *p; volatile bool stop =3D false; pid_t flusher_pid; int fd; #define PAGE_SIZE (4096ul) #define PAGES_PER_PMD (512) #define HPAGE_SIZE (PAGE_SIZE * PAGES_PER_PMD) // Comment MUNMAP_TEST for MADV_DONTNEED test #define MUNMAP_TEST void *dirtying_thread(void *arg) { int i; while (!stop) { for (i =3D 1; i < PAGES_PER_PMD; i++) { *(volatile char *)(p + (i * PAGE_SIZE) + 64) =3D = 5; } } return NULL; } void *checking_thread(void *arg) { volatile unsigned long *ul_p =3D (volatile unsigned long*)p; unsigned long cnt =3D 0; while (!stop) { *ul_p =3D cnt; if (*ul_p !=3D cnt) { printf("FAILED: expected %ld, got %ld\n", cnt, = *ul_p); kill(flusher_pid, SIGTERM); exit(0); } cnt++; } return NULL; } void *remap_thread(void *arg) { void *ptr; struct timespec t =3D { .tv_nsec =3D 10000, }; while (!stop) { #ifdef MUNMAP_TEST ptr =3D mmap(p, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0); if (ptr =3D=3D MAP_FAILED) handle_error("remap_thread"); #else if (madvise(p, PAGE_SIZE, MADV_DONTNEED) < 0) handle_error("MADV_DONTNEED"); nanosleep(&t, NULL); #endif } return NULL; } void flushing_process(void) { // Remove the pages to speed up rmap_walk and allow to drop = caches. if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0) handle_error("MADV_DONTNEED"); while (true) { if (msync(p, PAGE_SIZE, MS_SYNC)) handle_error("msync"); if (posix_fadvise(fd, 0, PAGE_SIZE, = POSIX_FADV_DONTNEED)) handle_error("posix_fadvise"); } } int main(int argc, char *argv[]) { void *(*thread_funcs[])(void*) =3D { &dirtying_thread, &checking_thread, &remap_thread, };=09 int r, i; int rc1, rc2; unsigned long addr; void *ptr; char *page =3D malloc(PAGE_SIZE); int n_threads =3D sizeof(thread_funcs) / sizeof(*thread_funcs); pthread_t *threads =3D malloc(sizeof(pthread_t) * n_threads); pid_t pid; =09 if (argc < 2) { fprintf(stderr, "usages: %s [filename]\n", argv[0]); exit(EXIT_FAILURE); } fd =3D open(argv[1], O_RDWR|O_CREAT, 0666); if (fd =3D=3D -1) handle_error("open fd"); for (i =3D 0; i < PAGES_PER_PMD; i++) { if (write(fd, page, PAGE_SIZE) !=3D PAGE_SIZE) handle_error("write"); } free(page); ptr =3D mmap(NULL, HPAGE_SIZE * 2, PROT_NONE, = MAP_PRIVATE|MAP_ANON, -1, 0); if (ptr =3D=3D MAP_FAILED) handle_error("mmap anon"); addr =3D (unsigned long)(ptr + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - = 1); printf("starting...\n"); ptr =3D mmap((void *)addr, HPAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0); if (ptr =3D=3D MAP_FAILED) handle_error("mmap file - start"); =09 p =3D ptr; for (i =3D 0; i < n_threads; i++) { r =3D pthread_create(&threads[i], NULL, thread_funcs[i], = NULL); if (r) handle_error("pthread_create"); } // Run the flushing process in a different process, so msync() = would // not require mmap_lock. pid =3D fork(); if (pid =3D=3D 0) flushing_process(); flusher_pid =3D pid; sleep(60); stop =3D true; for (i =3D 0; i < n_threads; i++) pthread_join(threads[i], NULL); kill(flusher_pid, SIGTERM); printf("Finished without an error"); exit(0); }=