From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C6C8C33CB6 for ; Thu, 23 Jan 2020 21:39:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C28A920665 for ; Thu, 23 Jan 2020 21:39:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="DQAApT6N" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C28A920665 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5A5DB6B027F; Thu, 23 Jan 2020 16:39:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 555CF6B0280; Thu, 23 Jan 2020 16:39:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 46D666B0281; Thu, 23 Jan 2020 16:39:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0222.hostedemail.com [216.40.44.222]) by kanga.kvack.org (Postfix) with ESMTP id 314036B027F for ; Thu, 23 Jan 2020 16:39:38 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id D6F4A2C89 for ; Thu, 23 Jan 2020 21:39:37 +0000 (UTC) X-FDA: 76410215994.09.smile28_5403ee598ce0e X-HE-Tag: smile28_5403ee598ce0e X-Filterd-Recvd-Size: 7786 Received: from hqnvemgate25.nvidia.com (hqnvemgate25.nvidia.com [216.228.121.64]) by imf38.hostedemail.com (Postfix) with ESMTP for ; Thu, 23 Jan 2020 21:39:36 +0000 (UTC) Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate25.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Thu, 23 Jan 2020 13:39:19 -0800 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Thu, 23 Jan 2020 13:39:35 -0800 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Thu, 23 Jan 2020 13:39:35 -0800 Received: from [10.20.23.90] (172.20.13.39) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Thu, 23 Jan 2020 21:39:35 +0000 X-Nvconfidentiality: public From: Vikram Sethi To: CC: , , Subject: Memory failure handling of VFIO-pinned THP Message-ID: <902d2541-3da6-8519-3e94-d435afb5e19c@nvidia.com> Date: Thu, 23 Jan 2020 15:39:33 -0600 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 X-Originating-IP: [172.20.13.39] X-ClientProxiedBy: HQMAIL107.nvidia.com (172.20.187.13) To HQMAIL107.nvidia.com (172.20.187.13) Content-Type: multipart/alternative; boundary="------------B6C35BBE7F0E5220DE3AB56F" Content-Language: en-US DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1579815559; bh=AnrALSHg2c0O28PgiRrIOE2IB00vxEnj36EDtaJ2R2U=; h=X-PGP-Universal:X-Nvconfidentiality:From:To:CC:Subject:Message-ID: Date:User-Agent:MIME-Version:X-Originating-IP:X-ClientProxiedBy: Content-Type:Content-Language; b=DQAApT6NQWI8d7CRT+tXEolbauyXw1WAkbC0v4hVrBnfxBEnF/0HQipraK/Lrawt2 GCCDWghwTiJvvDeuP3h7int5hLFLtYFZnQ42bTzrK+oAXhn9gckNVekYdt9e//h6bp skCu/Om8MhHaewtwvb9rwxRdH/9AF5xmnW64Nd3pybKn7MH3yR2dsuA9Slh8KXEU1O e2xVLzxkQnWUstcyVYlVhvOAzFDG647WqzQsjOcyN/OQcTijy7gG5StEk+MBPPBvBq T1CAxuTfK+VdYQoCPxAeOF5WRapJyrzGWrdDwSvtuEvHLy8VwzqwwL9Z2eRKibfLf7 TN3Xu5l1qq1Uw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --------------B6C35BBE7F0E5220DE3AB56F Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello, I was looking at memory_failure handling of pinned transparent hugepages = (specifically pinned by VFIO for a VM with physical I/O). AFAICT, on the initial memory error detected interrupt call memory_failur= e won't be able to split the THP because it is pinned, and will return -E= BUSY without actually unmapping any processes with mappings to the THP wi= th uncorrected memory error. =C2=A0 Later, when the VM does a load to the bad location (consumes poison), loo= king at the firmware first path on ARM64, the SEA exception will be forwa= rded by Firmware to host kernel, where the GHES code will queue work for = memory_failure, where again memory_failure will exit early for the pinned= THP, and userspace won't get the SIGBUS with Action Required code to be = able to inject the error into the VM. =C2=A0 Discussing with James, we were wondering why the pinned THP isn't treated= like hugetlbfs memory failure, marking the entire hugepage with hw_poiso= n flag, and unmapping of mapped processes when the error is detected (mem= ory_failure_hugetlb calling hwpoison_user_mappings)? If that were done, w= hen the VM later tries to load the bad location, the resulting VM fault w= ill get the appropriate VM_FAULT_HWPOISON code, which will trigger KVM to= send the SIGBUS with Action Required code to userspace, which can then i= nject to the VM? I do understand that the page is pinned so that DMAs can happen from the = VM's I/O devices without I/O faults, but since the hw_poison flag would b= e set for the page on the initial "error detected" interrupt by memory_fa= ilure, the kernel wouldn't reallocate the page anyway. And any interim DM= A writes that hit the bad page wouldn't be corrupting anyone else, and DM= A reads would be getting poison back/completer abort.=C2=A0 =C2=A0 Am I missing something, or is this currently broken for VFIO and VM THP p= ages with memory failure (at least as far as signaling user space goes)? =C2=A0 Thanks, Vikram --------------B6C35BBE7F0E5220DE3AB56F Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable

Hello, <= /p>

I was looking at memory_failure handling of pinned transparent hugepages (specifically pinned by VFIO for a VM with physical I/O).

AFAICT, on the initial memory error detected interrupt call memory_failure won't be able to split the THP because it is pinned, and will return -EBUSY without actually unmapping any processes with mappings to the THP with uncorrected memory error.

=C2=A0

Later, when the VM does a load to the bad location (consumes poison), looking at the firmware first path on ARM64, the SEA exception will be forwarded by Firmware to host kernel, where the GHES code will queue work for memory_failure, where again memory_failure will exit early for the pinned THP, and userspace won't get the SIGBUS with Action Required code to be able to inject the error into the VM.

=C2=A0

Discussi= ng with James, we were wondering why the pinned THP isn't treated like hugetlbfs memory failure, marking the entire hugepage with hw_poison flag, and unmapping of mapped processes when the error is detected (memory_failure_hugetlb calling hwpoison_user_mappings)? If that were done, when the VM later tries to load the bad location, the resulting VM fault will get the appropriate VM_FAULT_HWPOISON code, which will trigger KVM to send the SIGBUS with Action Required code to userspace, which can then inject to the VM?

I do understand that the page is pinned so that DMAs can happen from the VM's I/O devices without I/O faults, but since the hw_poison flag would be set for the page on the initial "error detected" interrupt by memory_failure, the kernel wouldn't reallocate the page anyway. And any interim DMA writes that hit the bad page wouldn't be corrupting anyone else, and DMA reads would be getting poison back/completer abort.=C2=A0

=C2=A0

Am I missing something, or is this currently broken for VFIO and VM THP pages with memory failure (at least as far as signaling user space goes)?

=C2=A0

Thanks,<= /p>

Vikram --------------B6C35BBE7F0E5220DE3AB56F--