From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8CDBC52D7C for ; Mon, 12 Aug 2024 07:00:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 795F76B0092; Mon, 12 Aug 2024 03:00:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 745316B0098; Mon, 12 Aug 2024 03:00:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5BE616B009F; Mon, 12 Aug 2024 03:00:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 3E4AF6B0092 for ; Mon, 12 Aug 2024 03:00:00 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id CD896121379 for ; Mon, 12 Aug 2024 06:59:59 +0000 (UTC) X-FDA: 82442693718.28.E30053C Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf20.hostedemail.com (Postfix) with ESMTP id A787D1C0009 for ; Mon, 12 Aug 2024 06:59:57 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Cyg3kXY7; spf=pass (imf20.hostedemail.com: domain of jasowang@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=jasowang@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723445928; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HnD7jw0Bt4ZgIPCQPfN1keNyfXuampWgJ5FBB2TKuvs=; b=wOdyTAyJeDvQ1F6mjPVyxS2aFQT1oNsagdYCKzycrW2W6bsdFsdYLFwCUrR9fNEC1EDzS7 +Pd2AUCgT46ongVUN4QgS3xlT4FDhaAlqURMBZBc41ck8WmjHEHQQbyDyTZZdF8xD8xhv1 z5DRnp/ygrMKRhnuiVmZWSIylKzF1Hc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723445928; a=rsa-sha256; cv=none; b=Dm6xrp+Z0UWOc1L9m+FRvR2SV1ytV55Jqzq3FTM2gzZy92QHvVCL7QXzJQb0URn8vJDzY6 WSvSkOWcJQFNoeTXwH/XcEmPTdMuQfT19hlaPZUwNCLtaxA5CdXTrteeIM9tl3biKr9Yix vkuQvtl8KroHy/zAHgC2/2C0rSOTMT8= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Cyg3kXY7; spf=pass (imf20.hostedemail.com: domain of jasowang@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=jasowang@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1723445997; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HnD7jw0Bt4ZgIPCQPfN1keNyfXuampWgJ5FBB2TKuvs=; b=Cyg3kXY74me3RNjCvk1YQvRzIgvbTd3LE26lx7A9nduB2kqwWo6qYYLKqUrkojgMcdlrhk FbGS0mgp+9qVsfApEIwkyh9iTHuwpYQoBu5tDZTPhgaaa9ATvWpETGJZfZy1RGm6lLf1us e0NIXA7zO0prSE9Fbq1zVEGLLTPgmi8= Received: from mail-pj1-f72.google.com (mail-pj1-f72.google.com [209.85.216.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-517-FEmShf7hPEK6rqVKUS9nIg-1; Mon, 12 Aug 2024 02:59:55 -0400 X-MC-Unique: FEmShf7hPEK6rqVKUS9nIg-1 Received: by mail-pj1-f72.google.com with SMTP id 98e67ed59e1d1-2cb576921b6so4872785a91.1 for ; Sun, 11 Aug 2024 23:59:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723445994; x=1724050794; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HnD7jw0Bt4ZgIPCQPfN1keNyfXuampWgJ5FBB2TKuvs=; b=cOk4vM68+DtbeiYIf3aBqKlLTk3p+TSeAot3bWC4PRRg2ynhNelBHeeRCVw9ITcyuS lLo6FP6ycB1FBmbd6/LmnvKIFyvNcR4WNSqVq6SJfN3MlXitl4bHEJlgECkAFlNwUspy B8HsKxuvTgsLaNJ8xx2hjV5OGDV15YkkIAz6YUHzgfi5Grd73Ek0JVGIU2zJiLPMzCdx rZzfzqmL/cWMzkT19Kctn6xA6O7wfsyB6HSPsnALvrPTRtPSioQnJ+w8WNPIy8VeSp/A vKgqg0eY+bBLXLmiPV/K3k9gbQeYptwCiJYr1q0OkPnGLZLXGJtMWXnJ+cWEb9eyQthI FAZg== X-Forwarded-Encrypted: i=1; AJvYcCW/fhWiC6Y5n+07uPceW/EHUT6QMr5Kp/XodyRE6qLKQG5mnEKwkSqN7dugtSvoUMdx6YLe8PC4Oa05JNMeDWrvvPc= X-Gm-Message-State: AOJu0YyNtB8IFu+FsVbYy6eNxcvwcM/JuCpzIdbfqX6e6JEAAJ6J7K7X cYXZOXiwLrTt/IeuLcC2Hvist36+2beJ63WCNZt0QyQkCz/CUt5AXSVO6E1xGuZLGb57NE3orTG YNlJKutIlbRHVbo32mEEZ9TlZwlodLLMZlOqVoDhV8S8yOzuT1c3AQaXxMc1XX5QvAxLiHdm5jm jAQ0TmFhjg6qDeB5Pu5qanEidjFbdQqIJOsA== X-Received: by 2002:a17:90a:a892:b0:2cd:2992:e8dc with SMTP id 98e67ed59e1d1-2d1e7f9c914mr9059827a91.5.1723445993955; Sun, 11 Aug 2024 23:59:53 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEAqMEDFInM0ZXBXhl6GppTaPVIO/zb/2LVAUPlZcsu/vLFxI6LVLxnUbM00uBcPSaRznliuvG3zvK+Z/eoLcI= X-Received: by 2002:a17:90a:a892:b0:2cd:2992:e8dc with SMTP id 98e67ed59e1d1-2d1e7f9c914mr9059806a91.5.1723445993402; Sun, 11 Aug 2024 23:59:53 -0700 (PDT) MIME-Version: 1.0 References: <20240805082106.65847-1-jasowang@redhat.com> In-Reply-To: From: Jason Wang Date: Mon, 12 Aug 2024 14:59:42 +0800 Message-ID: Subject: Re: Re: [PATCH] vduse: avoid using __GFP_NOFAIL To: Yongji Xie Cc: Maxime Coquelin , Xuan Zhuo , "Michael S. Tsirkin" , Eugenio Perez Martin , virtualization@lists.linux.dev, linux-kernel , 21cnbao@gmail.com, penguin-kernel@i-love.sakura.ne.jp, linux-mm@kvack.org, Andrew Morton X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: A787D1C0009 X-Stat-Signature: ebpchqe5ky5pfrjy1e683dutqswpt8dp X-HE-Tag: 1723445997-261366 X-HE-Meta: U2FsdGVkX1986W0KuIsK3Gwghnti2BDpv7kFU9qnzlqmHfTMVtgqiEQa8hJ0HRAV2EM+HUxG6rgKnUfatTw2janHmbK+qiO6ik3J97fRDFhchImYsGvnVG4dFraFiGXvai1JZwjituiMmX7izkwHjf5Uyr5YmCVd7uvYvgr4zEmFTz4lQkxCfbfhlX6R/0RWripwqUibz/FM88juqFJ4bjR/ftFy0m2I6o3v39qV8uzUFsBoaPmAKLYX3IlYjP/EK+Jl3DSDURkRdIloVamYfkr4UoS+PTQmr6dfLdDznOM2574SNaohFrU9fXfdcr7zhW6rXXArRFB52B6BTLl7HE6vW29AFBdX1Opo6nfFWJB5t9NGR0AyJezmkoeRJVqzVra8O23LMHcDjHu7MBG7mchToxynuO4f381SZIkijJRSuiyfBz1WflxFCQPMDPpOpIQ3o9ikc6Y+k7KZyWfFYeZ7GPB9m87wdxOfuQlPopF0hx9OcUYzNrOwdcFDL7eIRZzL0QEk39283uSIHYgcdgQlnkb67xCcEGqonpcw7UOjdI24kd47J3qBf+SEVk9ZBL3sZYuu+vncdQwjnTm2MARvFal3TWmDFqJvdQ4u8HOjrSUiMPMbxmK1UxiVHsS/wWlHrPuoff2qcTfN24IzhxJt34SUOFoObtnuNhbOYfVPGFWB7K25UdJDo2lxx8n+9iVnfa4G9n6ku0jgxHNmo4rRVUa69egnMmBUzIGmSGAKzmV9b6m7OjW+DgN/wsUNAmTZVJxUWZ62YemkTiMi8EBr+xkgmUWTQvXNGDDpIUCtw6Iow9epdPB1UpQAxmxPMBG/QLZFo+6QNJb3NN8w6SXEXG9aqSzJstgWW1qe5v04Q0ZOYrxBOB5aD63DSOsdgEnTnCBvFcHgivMYk6NXcyzJdrShG/UuYvfXBNmCC4qfyNmDtzs1XD6yvvMwHrO3XsWPe0Dg6jBrscpNFSW vf5f9x9A DcOicOtNQjHNoar0sayT2g9rjO4/6IlGLtyaWL7GhuxW0+SBdDhvn7Yd+GRgLG0i6b8dZg7OSjTySChu1PsP2uk/UYeCcYpar6VnE9Yy2tD+8onchPt/Sd+xmgAMWwagLBOYsMLlM/0VX64j6llt+k+ZpND+01DTfX7pu3iR50aqTNG+StHmdfj/rTFTp5QrY7yxn2kDQRJS2WlUaTArGOqeB0wMqte7GnRTu0VRTuhZ3M8S5COsywQcV2UWxZdiUvqpQKja5AxtXozhH4lWLFFskMq6go8gI6ad3fS9DrQiR99y+4oT5YZ0p9fywTn3Bdvm8OxvJctqIIx6oloqF76h6wfgG9ieYUUfy X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 8, 2024 at 7:09=E2=80=AFPM Yongji Xie = wrote: > > On Thu, Aug 8, 2024 at 1:50=E2=80=AFPM Jason Wang w= rote: > > > > On Wed, Aug 7, 2024 at 2:54=E2=80=AFPM Yongji Xie wrote: > > > > > > On Wed, Aug 7, 2024 at 12:38=E2=80=AFPM Jason Wang wrote: > > > > > > > > On Wed, Aug 7, 2024 at 11:13=E2=80=AFAM Yongji Xie wrote: > > > > > > > > > > On Wed, Aug 7, 2024 at 10:39=E2=80=AFAM Jason Wang wrote: > > > > > > > > > > > > On Tue, Aug 6, 2024 at 11:10=E2=80=AFAM Yongji Xie wrote: > > > > > > > > > > > > > > On Tue, Aug 6, 2024 at 10:28=E2=80=AFAM Jason Wang wrote: > > > > > > > > > > > > > > > > On Mon, Aug 5, 2024 at 6:42=E2=80=AFPM Yongji Xie wrote: > > > > > > > > > > > > > > > > > > On Mon, Aug 5, 2024 at 4:24=E2=80=AFPM Jason Wang wrote: > > > > > > > > > > > > > > > > > > > > On Mon, Aug 5, 2024 at 4:21=E2=80=AFPM Jason Wang wrote: > > > > > > > > > > > > > > > > > > > > > > Barry said [1]: > > > > > > > > > > > > > > > > > > > > > > """ > > > > > > > > > > > mm doesn't support non-blockable __GFP_NOFAIL allocat= ion. Because > > > > > > > > > > > __GFP_NOFAIL without direct reclamation may just resu= lt in a busy > > > > > > > > > > > loop within non-sleepable contexts. > > > > > > > > > > > ""=E2=80=9C > > > > > > > > > > > > > > > > > > > > > > Unfortuantely, we do that under read lock. A possible= way to fix that > > > > > > > > > > > is to move the pages allocation out of the lock into = the caller, but > > > > > > > > > > > having to allocate a huge number of pages and auxilia= ry page array > > > > > > > > > > > seems to be problematic as well per Tetsuon [2]: > > > > > > > > > > > > > > > > > > > > > > """ > > > > > > > > > > > You should implement proper error handling instead of= using > > > > > > > > > > > __GFP_NOFAIL if count can become large. > > > > > > > > > > > """ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the problem is it's hard to do the error handling= in > > > > > > > > > fops->release() currently. > > > > > > > > > > > > > > > > vduse_dev_dereg_umem() should be the same, it's very hard t= o allow it to fail. > > > > > > > > > > > > > > > > > > > > > > > > > > So can we temporarily hold the user page refcount, and re= lease it when > > > > > > > > > vduse_dev_open()/vduse_domain_release() is executed. The = kernel page > > > > > > > > > allocation and memcpy can be done in vduse_dev_open() whi= ch allows > > > > > > > > > some error handling. > > > > > > > > > > > > > > > > Just to make sure I understand this, the free is probably n= ot the big > > > > > > > > issue but the allocation itself. > > > > > > > > > > > > > > > > > > > > > > Yes, so defer the allocation might be a solution. > > > > > > > > > > > > Would you mind posting a patch for this? > > > > > > > > > > > > > > > > > > > > > And if we do the memcpy() in open(), it seems to be a subtl= e userspace > > > > > > > > noticeable change? (Or I don't get how copying in vduse_dev= _open() can > > > > > > > > help here). > > > > > > > > > > > > > > > > > > > > > > Maybe we don't need to do the copy in open(). We can hold the= user > > > > > > > page refcount until the inflight I/O is completed. That means= the > > > > > > > allocation of new kernel pages can be done in > > > > > > > vduse_domain_map_bounce_page() and the release of old user pa= ges can > > > > > > > be done in vduse_domain_unmap_bounce_page(). > > > > > > > > > > > > This seems to be a subtle userspace noticeable behaviour? > > > > > > > > > > > > > > > > Yes, userspace needs to ensure that it does not reuse the old use= r > > > > > pages for other purposes before vduse_dev_dereg_umem() returns > > > > > successfully. The vduse_dev_dereg_umem() will only return success= fully > > > > > when there is no inflight I/O which means we don't need to alloca= te > > > > > extra kernel pages to store data. If we can't accept this, then y= our > > > > > current patch might be the most suitable. > > > > > > > > It might be better to not break. > > > > > > > > Actually during my testing, the read_lock in the do_bounce path slo= ws > > > > down the performance. Remove read_lock or use rcu_read_lock() to gi= ve > > > > 20% improvement of PPS. > > > > > > > > > > Looks like rcu_read_lock() should be OK here. > > > > The tricky part is that we may still end up behaviour changes (or lose > > some of the synchronization between kernel and bounce pages): > > > > RCU allows the read to be executed in parallel with the writer. So > > bouncing could be done in parallel with > > vduse_domain_add_user_bounce_pages(), there would be a race in two > > memcpy. > > > > Hmm...this is a problem. We may still need some userspace noticeable > behaviour, e.g. only allowing reg_umem/dereg_umem when the device is > not started. Exactly, maybe have a new userspace flag. Thanks > > Thanks, > Yongji >