From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 823D6C433E0 for ; Mon, 8 Feb 2021 10:37:41 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 015DA64E59 for ; Mon, 8 Feb 2021 10:37:40 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 015DA64E59 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5D9706B0006; Mon, 8 Feb 2021 05:37:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 58AF66B006C; Mon, 8 Feb 2021 05:37:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3DED86B006E; Mon, 8 Feb 2021 05:37:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0196.hostedemail.com [216.40.44.196]) by kanga.kvack.org (Postfix) with ESMTP id 217F36B0006 for ; Mon, 8 Feb 2021 05:37:40 -0500 (EST) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id D9E63362A for ; Mon, 8 Feb 2021 10:37:39 +0000 (UTC) X-FDA: 77794749438.24.cable15_4a050a3275fe Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin24.hostedemail.com (Postfix) with ESMTP id B80AC1A4A0 for ; Mon, 8 Feb 2021 10:37:39 +0000 (UTC) X-HE-Tag: cable15_4a050a3275fe X-Filterd-Recvd-Size: 9781 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf29.hostedemail.com (Postfix) with ESMTP for ; Mon, 8 Feb 2021 10:37:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1612780658; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pQkWudR5NaqUhP785ZyYy4UYb1aBaulu3BBFzJEhpos=; b=MjCL8JNdaKUO5dP0n3re4TrFl/exSAbBtSaeilXyBpjEY3Y89Ti0uTC+eKmeUbH1BJYjav irKavMIAL9XFXhJ3LV1CFzMoMFa/VuA7hPuGfog9zUPk2yEPJzNyjdNOQjywMdknt+Bad3 GLvcHOx5j27os1xoDlEQKdjyMekTK9Y= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-487-wUgzenNPNBuCRVrN4cVAXQ-1; Mon, 08 Feb 2021 05:37:34 -0500 X-MC-Unique: wUgzenNPNBuCRVrN4cVAXQ-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 6815019611AF; Mon, 8 Feb 2021 10:37:32 +0000 (UTC) Received: from [10.36.113.240] (ovpn-113-240.ams2.redhat.com [10.36.113.240]) by smtp.corp.redhat.com (Postfix) with ESMTP id 972F51ABE1; Mon, 8 Feb 2021 10:37:24 +0000 (UTC) Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin To: "Song Bao Hua (Barry Song)" , Matthew Wilcox Cc: "Wangzhou (B)" , "linux-kernel@vger.kernel.org" , "iommu@lists.linux-foundation.org" , "linux-mm@kvack.org" , "linux-arm-kernel@lists.infradead.org" , "linux-api@vger.kernel.org" , Andrew Morton , Alexander Viro , "gregkh@linuxfoundation.org" , "jgg@ziepe.ca" , "kevin.tian@intel.com" , "jean-philippe@linaro.org" , "eric.auger@redhat.com" , "Liguozhu (Kenneth)" , "zhangfei.gao@linaro.org" , "chensihang (A)" References: <1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com> <1612685884-19514-2-git-send-email-wangzhou1@hisilicon.com> <20210207213409.GL308988@casper.infradead.org> <20210208013056.GM308988@casper.infradead.org> From: David Hildenbrand Organization: Red Hat GmbH Message-ID: Date: Mon, 8 Feb 2021 11:37:24 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote: > > >> -----Original Message----- >> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of >> David Hildenbrand >> Sent: Monday, February 8, 2021 9:22 PM >> To: Song Bao Hua (Barry Song) ; Matthew Wilcox >> >> Cc: Wangzhou (B) ; linux-kernel@vger.kernel.org; >> iommu@lists.linux-foundation.org; linux-mm@kvack.org; >> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew >> Morton ; Alexander Viro ; >> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com; >> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth) >> ; zhangfei.gao@linaro.org; chensihang (A) >> >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory >> pin >> >> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote: >>> >>> >>>> -----Original Message----- >>>> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf >> Of >>>> Matthew Wilcox >>>> Sent: Monday, February 8, 2021 2:31 PM >>>> To: Song Bao Hua (Barry Song) >>>> Cc: Wangzhou (B) ; linux-kernel@vger.kernel.org; >>>> iommu@lists.linux-foundation.org; linux-mm@kvack.org; >>>> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew >>>> Morton ; Alexander Viro >> ; >>>> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com; >>>> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth) >>>> ; zhangfei.gao@linaro.org; chensihang (A) >>>> >>>> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory >>>> pin >>>> >>>> On Sun, Feb 07, 2021 at 10:24:28PM +0000, Song Bao Hua (Barry Song) wrote: >>>>>>> In high-performance I/O cases, accelerators might want to perform >>>>>>> I/O on a memory without IO page faults which can result in dramatically >>>>>>> increased latency. Current memory related APIs could not achieve this >>>>>>> requirement, e.g. mlock can only avoid memory to swap to backup device, >>>>>>> page migration can still trigger IO page fault. >>>>>> >>>>>> Well ... we have two requirements. The application wants to not take >>>>>> page faults. The system wants to move the application to a different >>>>>> NUMA node in order to optimise overall performance. Why should the >>>>>> application's desires take precedence over the kernel's desires? And why >>>>>> should it be done this way rather than by the sysadmin using numactl to >>>>>> lock the application to a particular node? >>>>> >>>>> NUMA balancer is just one of many reasons for page migration. Even one >>>>> simple alloc_pages() can cause memory migration in just single NUMA >>>>> node or UMA system. >>>>> >>>>> The other reasons for page migration include but are not limited to: >>>>> * memory move due to CMA >>>>> * memory move due to huge pages creation >>>>> >>>>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page >>>>> in the whole system. >>>> >>>> You're dodging the question. Should the CMA allocation fail because >>>> another application is using SVA? >>>> >>>> I would say no. >>> >>> I would say no as well. >>> >>> While IOMMU is enabled, CMA almost has one user only: IOMMU driver >>> as other drivers will depend on iommu to use non-contiguous memory >>> though they are still calling dma_alloc_coherent(). >>> >>> In iommu driver, dma_alloc_coherent is called during initialization >>> and there is no new allocation afterwards. So it wouldn't cause >>> runtime impact on SVA performance. Even there is new allocations, >>> CMA will fall back to general alloc_pages() and iommu drivers are >>> almost allocating small memory for command queues. >>> >>> So I would say general compound pages, huge pages, especially >>> transparent huge pages, would be bigger concerns than CMA for >>> internal page migration within one NUMA. >>> >>> Not like CMA, general alloc_pages() can get memory by moving >>> pages other than those pinned. >>> >>> And there is no guarantee we can always bind the memory of >>> SVA applications to single one NUMA, so NUMA balancing is >>> still a concern. >>> >>> But I agree we need a way to make CMA success while the userspace >>> pages are pinned. Since pin has been viral in many drivers, I >>> assume there is a way to handle this. Otherwise, APIs like >>> V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there >>> is no guarantee that usersspace will allocate unmovable memory >>> and there is no guarantee the fallback path- alloc_pages() can >>> succeed while allocating big memory. >>> >> >> Long term pinnings cannot go onto CMA-reserved memory, and there is >> similar work to also fix ZONE_MOVABLE in that regard. >> >> https://lkml.kernel.org/r/20210125194751.1275316-1-pasha.tatashin@soleen.c >> om >> >> One of the reasons I detest using long term pinning of pages where it >> could be avoided. Take VFIO and RDMA as an example: these things >> currently can't work without them. >> >> What I read here: "DMA performance will be affected severely". That does >> not sound like a compelling argument to me for long term pinnings. >> Please find another way to achieve the same goal without long term >> pinnings controlled by user space - e.g., controlling when migration >> actually happens. >> >> For example, CMA/alloc_contig_range()/memory unplug are corner cases >> that happen rarely, you shouldn't have to worry about them messing with >> your DMA performance. > > I agree CMA/alloc_contig_range()/memory unplug would be corner cases, > the major cases would be THP, NUMA balancing while we could totally > disable them but it seems insensible to do that only because there is > a process using SVA in the system. Can't you use huge pages in your application that uses SVA and prevent THP/NUMA balancing from kicking in? -- Thanks, David / dhildenb