From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BAFD3C4332F for ; Mon, 17 Oct 2022 09:08:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CC91A6B0072; Mon, 17 Oct 2022 05:08:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C78726B0074; Mon, 17 Oct 2022 05:08:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B40616B0075; Mon, 17 Oct 2022 05:08:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A215E6B0072 for ; Mon, 17 Oct 2022 05:08:40 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 623331C646C for ; Mon, 17 Oct 2022 09:08:40 +0000 (UTC) X-FDA: 80029866000.14.C40B81C Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com [115.124.30.57]) by imf02.hostedemail.com (Postfix) with ESMTP id A11B980034 for ; Mon, 17 Oct 2022 09:08:38 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R301e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VSLPJYu_1665997713; Received: from 30.97.48.54(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0VSLPJYu_1665997713) by smtp.aliyun-inc.com; Mon, 17 Oct 2022 17:08:34 +0800 Message-ID: <8007f4fc-d2e6-7aae-7297-805326adce2a@linux.alibaba.com> Date: Mon, 17 Oct 2022 17:09:05 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.3.0 Subject: Re: [RFC PATCH] mm: Introduce new MADV_NOMOVABLE behavior To: David Hildenbrand , akpm@linux-foundation.org Cc: arnd@arndb.de, jingshan@linux.alibaba.com, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org References: <6227ba4c-9455-9652-7434-7842b2b3edcb@redhat.com> From: Baolin Wang In-Reply-To: <6227ba4c-9455-9652-7434-7842b2b3edcb@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665997720; a=rsa-sha256; cv=none; b=NWWYjvaOYp13xQImfGsFx+e5NZ0/3R2psfa4tby2azJWHFyFKoUFEt3RgCaqMw7UUXJT4q iSmzv23miDB3Kwm+H2/Fy08woKUeJttKS5HollQd4PnhAs/9xU0vE9eUCY2j9RWxa+3gvh sJXRnRxXXz7n0O32DfJ3+xKvS3ZaoKE= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=none; spf=pass (imf02.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665997720; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Raa+GWI3EtGM1A0X1dzkcf1wFrVdpNTcP87Z4a2Ii5I=; b=eY5TreZmSvxbhLWHjOMU41eh6Ga0EOhyqohKHLPMYBCrR/MX+XEVhRsooAmSZ/W3tsGGSL P3pV7KSSQqkfY/LUlDy06iUsDVWbojAsSZBUFlUypd/NIVMDUNYGt9zJCtScMVadLJQk8G 4SUKfZvJJwwP+CmE/VZ85uB63jd6aSc= X-Stat-Signature: 4u5z4ndbr9agqt657a7c1htstkxe6q1g X-Rspamd-Queue-Id: A11B980034 X-Rspam-User: X-Rspamd-Server: rspam03 Authentication-Results: imf02.hostedemail.com; dkim=none; spf=pass (imf02.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-HE-Tag: 1665997718-596776 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10/17/2022 4:41 PM, David Hildenbrand wrote: > On 17.10.22 09:32, Baolin Wang wrote: >> When creating a virtual machine, we will use memfd_create() to get >> a file descriptor which can be used to create share memory mappings >> using the mmap function, meanwhile the mmap() will set the MAP_POPULATE >> flag to allocate physical pages for the virtual machine. >> >> When allocating physical pages for the guest, the host can fallback to >> allocate some CMA pages for the guest when over half of the zone's free >> memory is in the CMA area. >> >> In guest os, when the application wants to do some data transaction with >> DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and >> create IOMMU mappings for the DMA pages. However, when calling >> VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be >> failed to longterm-pin sometimes. >> >> After some invetigation, we found the pages used to do DMA mapping can >> contain some CMA pages, and these CMA pages will cause a possible >> failure of the longterm-pin, due to failed to migrate the CMA pages. >> The reason of migration failure may be temporary reference count or >> memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA >> ioctl returns error, which makes the application failed to start. >> >> To fix this issue, this patch introduces a new madvise behavior, named >> as MADV_NOMOVABLE, to avoid allocating CMA pages and movable pages if >> the users want to do longterm-pin, which can remove the possible failure >> of movable or CMA pages migration. > > Sorry to say, but that sounds like a hack to work around a kernel > implementation detail (how often we retry to migrate pages). IMO, in our case one migration failure will make our application failed to start, which is not a trival problem. So mitigate the failure of migration can be important in this case. > If there are CMA/ZONE_MOVABLE issue, please fix them instead, and avoid > leaking these details to user space. Now we can not forbid the fallback to CMA allocation if there are enough free CMA in the zone, right? So adding a hint to help to diable ALLOC_CMA flag seems reasonable? For CMA/ZONE_MOVABLE details, yes, not suitable to leak to user space. so how about rename the madvise as MADV_PINNABLE, which means we will do longterm-pin after allocation, and no CMA/ZONE_MOVABLE pages will be allocated. Or do you have any good idea? Thanks. > ALSO, with MAP_POPULATE as described by you this madvise flag doesn't > make too much sense, because it will gets et after all memory already > was allocated ... This is not a problem I think, we can change to use MADV_POPULATE_XXX to preallocate the physical pages after MADV_NOMOVABLE madvise.