From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f198.google.com (mail-qt0-f198.google.com [209.85.216.198]) by kanga.kvack.org (Postfix) with ESMTP id 62C9D6B0253 for ; Thu, 12 Oct 2017 16:28:41 -0400 (EDT) Received: by mail-qt0-f198.google.com with SMTP id k31so12603115qta.22 for ; Thu, 12 Oct 2017 13:28:41 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id 128si2488805qkd.368.2017.10.12.13.28.38 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 12 Oct 2017 13:28:39 -0700 (PDT) Subject: Re: [PATCH v8 9/9] sparc64: Add support for ADI (Application Data Integrity) Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Content-Type: text/plain; charset=us-ascii From: Anthony Yznaga In-Reply-To: <5edaf7dc-6bc7-c365-0b54-b78975c08894@oracle.com> Date: Thu, 12 Oct 2017 13:27:25 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <782BD060-74C5-4D9B-B013-731249A72F87@oracle.com> References: <9e3a8c90ade57d94d1ab2100c6d9508fc2d0a212.1506089472.git.khalid.aziz@oracle.com> <5edaf7dc-6bc7-c365-0b54-b78975c08894@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Khalid Aziz Cc: David Miller , dave.hansen@linux.intel.com, corbet@lwn.net, Bob Picco , STEVEN_SISTARE , Pasha Tatashin , Mike Kravetz , Rob Gardner , mingo@kernel.org, Nitin Gupta , kirill.shutemov@linux.intel.com, Tom Hromatka , Eric Saint Etienne , Allen Pais , cmetcalf@mellanox.com, akpm@linux-foundation.org, geert@linux-m68k.org, pmladek@suse.com, tklauser@distanz.ch, Atish Patra , Shannon Nelson , Vijay Kumar , peterz@infradead.org, mhocko@suse.com, jack@suse.cz, lstoakes@gmail.com, punit.agrawal@arm.com, hughd@google.com, thomas.tai@oracle.com, paul.gortmaker@windriver.com, ross.zwisler@linux.intel.com, dave.jiang@intel.com, willy@infradead.org, ying.huang@intel.com, zhongjiang@huawei.com, minchan@kernel.org, imbrenda@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com, aarcange@redhat.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, sparclinux@vger.kernel.org, linux-mm@kvack.org, Khalid Aziz > On Oct 12, 2017, at 7:44 AM, Khalid Aziz = wrote: >=20 > Hi Anthony, >=20 > Please quote only the relevant parts of the patch with comments. That = makes it much easier to find the comments. Okay. >=20 > On 10/06/2017 04:12 PM, Anthony Yznaga wrote: >>> On Sep 25, 2017, at 9:49 AM, Khalid Aziz = wrote: >>>=20 >>> This patch extends mprotect to enable ADI (TSTATE.mcde), = enable/disable >>> MCD (Memory Corruption Detection) on selected memory ranges, enable >>> TTE.mcd in PTEs, return ADI parameters to userspace and save/restore = ADI >>> version tags on page swap out/in or migration. ADI is not enabled by >> I still don't believe migration is properly supported. Your >> implementation is relying on a fault happening on a page while its >> migration is in progress so that do_swap_page() will be called, but >> I don't see how do_swap_page() will be called if a fault does not >> happen until after the migration has completed. >=20 > User pages are on LRU list and for the mapped pages on LRU list, = migrate_pages() ultimately calls try_to_unmap_one and makes a migration = swap entry for the page being migrated. This forces a page fault upon = access on the destination node and the page is swapped back in from swap = cache. The fault is forced by the migration swap entry, rather than = fault being an accidental event. If page fault happens on the = destination node while migration is in progress, do_swap_page() waits = until migration is done. Please take a look at the code in = __unmap_and_move(). I looked at the code again, and I now believe ADI tags are never = restored for migrated pages. Here's why: A successful call to try_to_unmap() by __unmap_and_move() will have = unmapped the page, replaced the pte with a migration pte, and saved the = ADI tags. If an access to the unmapped VA range is attempted while the migration = pte is in place, handle_pte_fault() will call do_swap_page() because the = page present flag is not set in the pte. do_swap_page() will see that = the pte is a migration pte and call migration_entry_wait() where it will = block until the migration pte is removed. do_swap_page() will then = return so that the fault is retried. remove_migration_pte() replaces the migration pte with a regular pte. = The regular pte will have the page present flag set. Whether due to a = retry or not, the next fault on the VA range will therefore not call = do_swap_page() and the tags will not be restored. >=20 >=20 >>> +#define finish_arch_post_lock_switch = finish_arch_post_lock_switch >>> +static inline void finish_arch_post_lock_switch(void) >>> +{ >>> + /* Restore the state of MCDPER register for the new process >>> + * just switched to. >>> + */ >>> + if (adi_capable()) { >>> + register unsigned long tmp_mcdper; >>> + >>> + tmp_mcdper =3D test_thread_flag(TIF_MCDPER); >>> + __asm__ __volatile__( >>> + "mov %0, %%g1\n\t" >>> + ".word 0x9d800001\n\t" /* wr %g0, %g1, %mcdper" = */ >>> + ".word 0xaf902001\n\t" /* wrpr %g0, 1, %pmcdper = */ >>> + : >>> + : "ir" (tmp_mcdper) >>> + : "g1"); >>> + if (current && current->mm && current->mm->context.adi) = { >>> + struct pt_regs *regs; >>> + >>> + regs =3D task_pt_regs(current); >>> + regs->tstate |=3D TSTATE_MCDE; >> This works, but it costs additional cycles on every context switch to >> keep setting TSTATE_MCDE. PSTATE.mcde=3D1 only affects loads and = stores >> to memory mapped with TTE.mcd=3D1 so there is no impact if it is set = and >> no memory is mapped with TTE.mcd=3D1. That is why I suggested just >> setting TSTATE_MCDE once when a process thread is initialized. >=20 > This change was suggested by David Miller. I believe there is merit to = that suggestion. I'm not saying it's without merit. I just wanted to point out that the = solution adds a bit of additional work to every context switch and that = it's possible to avoid it. I'm fine if David still prefers his = solution. >=20 >>> + /* Tag storage has not been allocated for this vma and space >>> + * is available in tag storage descriptor. Since this page is >>> + * being swapped out, there is high probability subsequent pages >>> + * in the VMA will be swapped out as well. Allocate pages to >>> + * store tags for as many pages in this vma as possible but not >>> + * more than TAG_STORAGE_PAGES. Each byte in tag space holds >>> + * two ADI tags since each ADI tag is 4 bits. Each ADI tag >>> + * covers adi_blksize() worth of addresses. Check if the hole is >>> + * big enough to accommodate full address range for using >>> + * TAG_STORAGE_PAGES number of tag pages. >>> + */ >>> + size =3D TAG_STORAGE_PAGES * PAGE_SIZE; >>> + end_addr =3D addr + (size*2*adi_blksize()) - 1; >>> + /* Check for overflow. If overflow occurs, allocate only one = page */ >>> + if (end_addr < addr) { >>> + size =3D PAGE_SIZE; >>> + end_addr =3D addr + (size*2*adi_blksize()) - 1; >> end_addr could still overflow even with size =3D PAGE_SIZE. >> Maybe you could just set end_addr to (unsigned long)-1 and = recalculate >> the size based on that. >=20 > I agree at theoretical level. The number of VA bits is already limited = by the max implemented VA bit in hardware plus with ADI in use, top 4 = bits are not available as well either, so there is lot of unused room at = the upper end of VA and end_addr is not going to roll over. = Nevertheless, I can fix this as well for completeness sake. The MMU ignores the ADI tag bits and sign extends from the actual most = significant VA bit to get the actual VA so the hardware is capable of = mapping a user page at 0xffffffffffffe000. However, Linux imposes an = upper limit on the maximum user VA. There are theses comments in = arch/sparc/include/asm/processor_64.h: /* * User lives in his very own context, and cannot reference us. Note * that TASK_SIZE is a misnomer, it really gives maximum user virtual * address that the kernel will allocate out. * * XXX No longer using virtual page tables, kill this upper limit... */ Maybe this limit will remain forever, maybe not. >=20 >>> + } >>> + if (hole_end < end_addr) { >>> + /* Available hole is too small on the upper end of >>> + * address. Can we expand the range towards the lower >>> + * address and maximize use of this slot? >>> + */ >>> + unsigned long tmp_addr; >>> + >>> + end_addr =3D hole_end - 1; >>> + tmp_addr =3D end_addr - (size*2*adi_blksize()) + 1; >>> + /* Check for underflow. If underflow occurs, allocate >>> + * only one page for storing ADI tags >>> + */ >>> + if (tmp_addr > addr) { >> Should compare tmp_addr to end_addr rather than addr. >=20 > No, this is correct. If tmp_addr wraps around to the upper end, = theoretically it can be smaller than end_addr but still be bigger than = addr since addr < end_addr. The way it is written, this is a safer test. If you subtract a large enough value from end_addr to cause underflow, = the result will always be greater than end_addr. >=20 >>> + size =3D PAGE_SIZE; >>> + tmp_addr =3D addr + (size*2*adi_blksize()) - 1; >> copy/paste error? tmp_addr should be recalculated from end_addr and = a >> new size. The new size needs to be adjusted based on end_addr to as >> little as PAGE_SIZE. >=20 > Good catch. >=20 >>> diff --git a/arch/sparc/kernel/etrap_64.S = b/arch/sparc/kernel/etrap_64.S >>> index 1276ca2567ba..7be33bf45cff 100644 >>> --- a/arch/sparc/kernel/etrap_64.S >>> +++ b/arch/sparc/kernel/etrap_64.S >>> @@ -132,7 +132,33 @@ etrap_save: save %g2, -STACK_BIAS, %sp >>> stx %g6, [%sp + PTREGS_OFF + PT_V9_G6] >>> stx %g7, [%sp + PTREGS_OFF + PT_V9_G7] >>> or %l7, %l0, %l7 >>> - sethi %hi(TSTATE_TSO | TSTATE_PEF), %l0 >>> +661: sethi %hi(TSTATE_TSO | TSTATE_PEF), %l0 >>> + /* >>> + * If userspace is using ADI, it could potentially pass >>> + * a pointer with version tag embedded in it. To = maintain >>> + * the ADI security, we must enable PSTATE.mcde. = Userspace >>> + * would have already set TTE.mcd in an earlier call to >>> + * kernel and set the version tag for the address being >>> + * dereferenced. Setting PSTATE.mcde would ensure any >>> + * access to userspace data through a system call honors >>> + * ADI and does not allow a rogue app to bypass ADI by >>> + * using system calls. Setting PSTATE.mcde only affects >>> + * accesses to virtual addresses that have TTE.mcd set. >>> + * Set PMCDPER to ensure any exceptions caused by ADI >>> + * version tag mismatch are exposed before system call >>> + * returns to userspace. Setting PMCDPER affects only >>> + * writes to virtual addresses that have TTE.mcd set and >>> + * have a version tag set as well. >>> + */ >>> + .section .sun_m7_1insn_patch, "ax" >>> + .word 661b >>> + sethi %hi(TSTATE_TSO | TSTATE_PEF | TSTATE_MCDE), %l0 >> rtrap is still missing patches to turn on TSTATE_MCDE when needed. >>> + .previous >>> +661: nop >>> + .section .sun_m7_1insn_patch, "ax" >>> + .word 661b >>> + .word 0xaf902001 /* wrpr %g0, 1, %pmcdper */ >> I still disagree with setting %pmcdper=3D1 on every trap into the = kernel, >> and now %pmcdper is also being set to 1 on every context switch. It >> should be sufficient to set it once for each CPU but also setting it >> on every context switch is at least less impactful than setting it on >> every etrap. >=20 > We discussed this before and I believe not setting %pmcdper on trap = into kernel can expose kernel to the possibility of running system calls = with deferred MCD exceptions which in turn causes unreliable behavior = from userspace point of view when MCD exception happens (userspace might = get SIGSEGV, or system call terminates with error depending upon when = exception is delievered). I believe it is important for system calls to = behave consistently. I'm not saying that %pmcdper should not be set to 1. I'm saying that = it's expensive and unnecessary to keep setting it to 1 on every trap = into the kernel and now every context switch (~50-70 cycles every time = the register is updated). It should be sufficient to set it once when = each CPU is configured. If it's really necessary to mitigate the = possibility of something in the kernel clearing %pmcdper and reseting it = then setting it on every context switch should be sufficient and less = onerous than every etrap. Anthony >=20 >=20 >>> diff --git a/arch/sparc/kernel/setup_64.c = b/arch/sparc/kernel/setup_64.c >>> index 150ee7d4b059..98a877715832 100644 >>> --- a/arch/sparc/kernel/setup_64.c >>> +++ b/arch/sparc/kernel/setup_64.c >>> @@ -293,6 +293,8 @@ static void __init sun4v_patch(void) >>> case SUN4V_CHIP_SPARC_M7: >>> case SUN4V_CHIP_SPARC_M8: >>> case SUN4V_CHIP_SPARC_SN: >>> + sun4v_patch_1insn_range(&__sun_m7_1insn_patch, >>> + &__sun_m7_1insn_patch_end); >>> sun_m7_patch_2insn_range(&__sun_m7_2insn_patch, >>> &__sun_m7_2insn_patch_end); >> Why did you keep sun_m7_patch_2insn_range() and not replace it with >> sun4v_m7_patch_2insn_range()? >=20 > sun_m7_patch_2insn_range() is already in the kernel and not part of = this patch. It can be replaced but that should be a separate patch in my = opinion. >=20 > Thanks, > Khalid > -- > To unsubscribe from this list: send the line "unsubscribe sparclinux" = in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org