linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	sparclinux@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	"David S. Miller" <davem@davemloft.net>, Hev <r@hev.cc>,
	Anatoly Pugachev <matorola@gmail.com>,
	Raghavendra K T <raghavendra.kt@amd.com>,
	Thorsten Leemhuis <regressions@leemhuis.info>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Juergen Gross <jgross@suse.com>
Subject: Re: [PATCH v1] sparc/mm: don't unconditionally set HW writable bit when setting PTE dirty on 64bit
Date: Tue, 31 Jan 2023 19:50:05 -0500	[thread overview]
Message-ID: <Y9m3PaiU2+YtLIJR@x1n> (raw)
In-Reply-To: <671d9bbb-0f19-2710-00ef-47734085dddc@redhat.com>

On Tue, Jan 31, 2023 at 09:47:01AM +0100, David Hildenbrand wrote:
> On 12.12.22 14:02, David Hildenbrand wrote:
> > On sparc64, there is no HW modified bit, therefore, SW tracks via a SW
> > bit if the PTE is dirty via pte_mkdirty(). However, pte_mkdirty()
> > currently also unconditionally sets the HW writable bit, which is wrong.
> > 
> > pte_mkdirty() is not supposed to make a PTE actually writable, unless the
> > SW writable bit (pte_write()) indicates that the PTE is not
> > write-protected. Fortunately, sparc64 also defines a SW writable bit.
> > 
> > For example, this already turned into a problem in the context of
> > THP splitting as documented in commit 624a2c94f5b7 ("Partly revert "mm/thp:
> > carry over dirty bit when thp splits on pmd") and might be an issue during
> > page migration in mm/migrate.c:remove_migration_pte() as well where we:
> > 	if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
> > 		pte = pte_mkdirty(pte);
> > 
> > But more general, anything like:
> > 	maybe_mkwrite(pte_mkdirty(pte), vma)
> > code is broken on sparc64, because it will unconditionally set the HW
> > writable bit even if the SW writable bit is not set.
> > 
> > Simple reproducer that will result in a writable PTE after ptrace
> > access, to highlight the problem and as an easy way to verify if it has
> > been fixed:
> > 
> > --------------------------------------------------------------------------
> >   #include <fcntl.h>
> >   #include <signal.h>
> >   #include <unistd.h>
> >   #include <string.h>
> >   #include <errno.h>
> >   #include <stdlib.h>
> >   #include <sys/mman.h>
> > 
> >   static void signal_handler(int sig)
> >   {
> >           if (sig == SIGSEGV)
> >                   printf("[PASS] SIGSEGV generated\n");
> >           else
> >                   printf("[FAIL] wrong signal generated\n");
> >           exit(0);
> >   }
> > 
> >   int main(void)
> >   {
> >           size_t pagesize = getpagesize();
> >           char data = 1;
> >           off_t offs;
> >           int mem_fd;
> >           char *map;
> >           int ret;
> > 
> >           mem_fd = open("/proc/self/mem", O_RDWR);
> >           if (mem_fd < 0) {
> >                   fprintf(stderr, "open(/proc/self/mem) failed: %d\n", errno);
> >                   return 1;
> >           }
> > 
> >           map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE|MAP_ANON, -1 ,0);
> >           if (map == MAP_FAILED) {
> >                   fprintf(stderr, "mmap() failed: %d\n", errno);
> >                   return 1;
> >           }
> > 
> >           printf("original: %x\n", *map);
> > 
> >           /* debug access */
> >           offs = lseek(mem_fd, (uintptr_t) map, SEEK_SET);
> >           ret = write(mem_fd, &data, 1);
> >           if (ret != 1) {
> >                   fprintf(stderr, "pwrite(/proc/self/mem) failed with %d: %d\n", ret, errno);
> >                   return 1;
> >           }
> >           if (*map != data) {
> >                   fprintf(stderr, "pwrite(/proc/self/mem) not visible\n");
> >                   return 1;
> >           }
> > 
> >           printf("ptrace: %x\n", *map);
> > 
> >           /* Install signal handler. */
> >           if (signal(SIGSEGV, signal_handler) == SIG_ERR) {
> >                   fprintf(stderr, "signal() failed\n");
> >                   return 1;
> >           }
> > 
> >           /* Ordinary access. */
> >           *map = 2;
> > 
> >           printf("access: %x\n", *map);
> > 
> >           printf("[FAIL] SIGSEGV not generated\n");
> > 
> >           return 0;
> >   }
> > --------------------------------------------------------------------------
> > 
> > Without this commit (sun4u in QEMU):
> > 	# ./reproducer
> > 	original: 0
> > 	ptrace: 1
> > 	access: 2
> > 	[FAIL] SIGSEGV not generated
> > 
> > Let's fix this by setting the HW writable bit only if both, the SW dirty
> > bit and the SW writable bit are set. This matches, for example, how
> > s390x handles pte_mkwrite() and pte_mkdirty() -- except, that they have
> > to clear the _PAGE_PROTECT bit.
> > 
> > We have to move pte_dirty() and pte_dirty() up. The code patching
> > mechanism and handling constants > 22bit is a bit special on sparc64.
> > 
> > With this commit (sun4u in QEMU):
> > 	# ./reproducer
> > 	original: 0
> > 	ptrace: 1
> > 	[PASS] SIGSEGV generated
> > 
> > This handling seems to have been in place forever.
> > 
> > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Hev <r@hev.cc>
> > Cc: Anatoly Pugachev <matorola@gmail.com>
> > Cc: Raghavendra K T <raghavendra.kt@amd.com>
> > Cc: Thorsten Leemhuis <regressions@leemhuis.info>
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Cc: Juergen Gross <jgross@suse.com>
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> 
> Ping

I agree with David that the current sparc64 impl of pte_mkdirty is
suspecious.

What David mentioned on page migration above is correct and has another
report here from Nick recently:

https://lore.kernel.org/all/CADyTPEzsvdRC15+Z5T3oryofwRYqHmHzwqRmJKJoHB3d7Tdayw@mail.gmail.com/

If this patch is hopefully correct (which I cannot tell as I know little on
sparc64) and can be merged, it'll be the cleanest solution, comparing to
what I provided here:

https://lore.kernel.org/all/Y9bvwz4FIOQ+D8c4@x1n/

And I assume it'll also fix things like the reproducer being attached on
wrongly applying write bit with FOLL_FORCE, so it fixes more than that.

I plan to keep posting that fix I referenced above for the breakage because
that'll still be the safest so far, but that can change if someone from
sparc64 can have a look at this and ack it.

Thanks,

-- 
Peter Xu



  reply	other threads:[~2023-02-01  0:50 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-12 13:02 David Hildenbrand
2023-01-31  8:47 ` David Hildenbrand
2023-02-01  0:50   ` Peter Xu [this message]
2023-02-16 15:36 ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y9m3PaiU2+YtLIJR@x1n \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=david@redhat.com \
    --cc=jgross@suse.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matorola@gmail.com \
    --cc=mike.kravetz@oracle.com \
    --cc=r@hev.cc \
    --cc=raghavendra.kt@amd.com \
    --cc=regressions@leemhuis.info \
    --cc=sparclinux@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox