From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5CC16C4345F for ; Fri, 19 Apr 2024 07:15:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 93C9B6B007B; Fri, 19 Apr 2024 03:15:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 913A86B0082; Fri, 19 Apr 2024 03:15:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7DB946B0083; Fri, 19 Apr 2024 03:15:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 5F54A6B007B for ; Fri, 19 Apr 2024 03:15:56 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id BDA4280DA0 for ; Fri, 19 Apr 2024 07:15:55 +0000 (UTC) X-FDA: 82025421870.25.D2DFF54 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by imf18.hostedemail.com (Postfix) with ESMTP id 0705F1C0008 for ; Fri, 19 Apr 2024 07:15:51 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="wHR5/nXU"; spf=pass (imf18.hostedemail.com: domain of xuanzhuo@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=xuanzhuo@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713510953; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UCDmf20Cih3kKrwUAIn8HTjy/hEIigc64u7/WKr64vU=; b=zfuntElNpS6xGjnkvs9ABX+IIv6fvJl4g1sR848SoIxFZH8yIIjfIfYOm4bgiM9mr1097S x1ZfYfGQIT5s1zxtiYwKdEYzQXHHq7c/C7P4DDDrYxHFAyi0oT9iq9q9tJUmQLjwoanS24 46si3lo2/AKC7AUyu2HWWbb71da1u9k= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="wHR5/nXU"; spf=pass (imf18.hostedemail.com: domain of xuanzhuo@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=xuanzhuo@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713510953; a=rsa-sha256; cv=none; b=idpSgl+RlITKi7RoSOb1dbPvgejmUjvZOKWMN2dYWc9Jo/18xBLfy8J8JbX/Mu5anSViQC QUQk90j0FRDSIRspA7xThFBPam3avh21f9v3R8+8HLZPNeOqEIEvjgvRTW98+0OpcQM8Er pc4erWz/9QViCsBeHylkhNxeJ9KpBlA= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1713510948; h=Message-ID:Subject:Date:From:To:Content-Type; bh=UCDmf20Cih3kKrwUAIn8HTjy/hEIigc64u7/WKr64vU=; b=wHR5/nXUrxvNafXYKuy/gge+Us8VpNTUi+fQ/Gst9SiRWFgaxoSV9OU7g+HMUVeiZwkoWvOrY53upMNzOSmHrrcWoWwjq1+v8UidwQJG8N8v2CmjWgJj7lWKIvFgd0ZqtsKLNLap7wb04x9xbitOrlkN9eQqaJjEHkNfWkr56JA= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046051;MF=xuanzhuo@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0W4rVZE5_1713510945; Received: from localhost(mailfrom:xuanzhuo@linux.alibaba.com fp:SMTPD_---0W4rVZE5_1713510945) by smtp.aliyun-inc.com; Fri, 19 Apr 2024 15:15:46 +0800 Message-ID: <1713510661.7868748-2-xuanzhuo@linux.alibaba.com> Subject: Re: [PATCH vhost 3/6] virtio_net: replace private by pp struct inside page Date: Fri, 19 Apr 2024 15:11:01 +0800 From: Xuan Zhuo To: Jesper Dangaard Brouer Cc: virtualization@lists.linux.dev, "Michael S. Tsirkin" , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , netdev@vger.kernel.org, "Linux-MM" , Matthew Wilcox , Ilias Apalodimas , Mel Gorman , Jason Wang References: <20240411025127.51945-1-xuanzhuo@linux.alibaba.com> <20240411025127.51945-4-xuanzhuo@linux.alibaba.com> <1712900153.3715405-1-xuanzhuo@linux.alibaba.com> <1713146919.8867755-1-xuanzhuo@linux.alibaba.com> <1713170201.06163-2-xuanzhuo@linux.alibaba.com> <1713171554.2423792-1-xuanzhuo@linux.alibaba.com> <1713317444.7698638-1-xuanzhuo@linux.alibaba.com> <1713342055.436048-1-xuanzhuo@linux.alibaba.com> In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 0705F1C0008 X-Stat-Signature: ctjjjhhoo55czaiuaqyaf1qb3uq1tkry X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1713510951-321828 X-HE-Meta: U2FsdGVkX18q2tLvbbAZh/z2suqd3pdkwfc+kMI1MCZceIYIsBMdWubgQlMYwdy35ELFytcEA/CDdeVsdEGeQwx/+lHXwMBwtG53Rjd9lTNiY/n46nz2nNXUCVXvU0SxfgS7z1OqKKJ6qXCI5pkeWyiXh4ADH/qc/vtCynrFn2daUxxNxpRzKTU7l9w2chUOlUUKDd31MTxamLPnrGcRvHFI04l1ZLJs/TrLURXBihsB6OjyA+1sjM940QUtb6/hMRRaY9H+gPe6pT6IW7kiqv+N16/4mPJAzmd5SVq4FuaZn9514YeNe6vZUYnVUV3v6T8uxJKZr8Lr3jROh62Kzj7On/BwzjoGiYVKXSuEKcyS6jh4TUO585Ht+J7DVvo8dXB0PQPJqf7yPNqVNISc0gL710Up9Tf4losurzWG6CWz1hbyspB0OZ84N4safhjxjX+9Yz9jRPPFg2uiZ456QDuN4QmfE105paplNQMNX0AplYS1B1KMyzHhRhcvTBy/2XTRI4NIGKYUpuz2XJn1ZfbPjMga+J+qp8k2LISaxAFTxxh0t+MXUSznsY93juICHoUHRgl0+U8Fmm7nD1JdAi6dUbD1VJyuRekxknZ0+VszLKfAgbZVWz/bFZWdAc1VOJZm3YhjkCIryAp8eko2HKqCGZyBmegvUqk1l15urjP3q0AwE0OmSCsNSDzUgrFIvm8SwV/3cdMd4SwzSKQi3b5kNx851OPm73P9v11hXF2MH44oRvCo3Dq937HzY+Q3MGEW0hFLw0kyISaMSaMjNK967pEUgCOCUjDSKSVRhqi/lFViGi80YXmValnFXMFPeXkERw6a4Sb7pX6EckrRU19dFy81oEURGXwCvPD3BR3zLMn9O29nyuu45jUU2dfq/PYAguwcg9R0vkWxcIVHz9p9cn3RAVzFGzCxBH3HvfJAlfuw3zfh7Q/M1EuyjCuXn6Hfjwb8oZw6ooMBbrv 92ielc1M PN94ZbFcczAphVECZp+BiT8GTrzBqXcz99lpqi4/9gAAOU21TKNRRKW7Uy4Mr57Zn9GIXe92JkK832LNr9G/gyAYtCLv9wT62MhJPXrH6e6lBH2vC8ceCErBNSTPhGvGGtQZiQVX8S+3ChLOC08o0AhoBOFLj0iMpJijm0MNGeXObHS0KvVUsUgQ0FnBWNLZP+1IiLXb/EiuzPJVWoNMYIkjg3tudd3j0vDCqjgu0WqA4+YDSziTwgwareFT5sAONOqSyuCOXDCI4M/v7mU5B3GYQfWP4Mxq/2c1nuqhhZ1ZjXrulMxps2mDpmCZmyWR6uRf/WSFvGs9NpQV7LotzgVdBSL0Z/rAZwruTypkZ268ITjmEz0+vk69daOYzSznI1JZz66WKiLYvqYurjS3SvWclwjpNxpTrDJOn/cIY7ZdpBQKL7nIuyyqGrsbYnyPDNw54 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, 18 Apr 2024 22:19:33 +0200, Jesper Dangaard Brouer wrote: > > > On 17/04/2024 10.20, Xuan Zhuo wrote: > > On Wed, 17 Apr 2024 12:08:10 +0800, Jason Wang wr= ote: > >> On Wed, Apr 17, 2024 at 9:38=E2=80=AFAM Xuan Zhuo wrote: > >>> > >>> On Tue, 16 Apr 2024 11:24:53 +0800, Jason Wang = wrote: > >>>> On Mon, Apr 15, 2024 at 5:04=E2=80=AFPM Xuan Zhuo wrote: > >>>>> > >>>>> On Mon, 15 Apr 2024 16:56:45 +0800, Jason Wang wrote: > >>>>>> On Mon, Apr 15, 2024 at 4:50=E2=80=AFPM Xuan Zhuo wrote: > >>>>>>> > >>>>>>> On Mon, 15 Apr 2024 14:43:24 +0800, Jason Wang wrote: > >>>>>>>> On Mon, Apr 15, 2024 at 10:35=E2=80=AFAM Xuan Zhuo wrote: > >>>>>>>>> > >>>>>>>>> On Fri, 12 Apr 2024 13:49:12 +0800, Jason Wang wrote: > >>>>>>>>>> On Fri, Apr 12, 2024 at 1:39=E2=80=AFPM Xuan Zhuo wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Fri, 12 Apr 2024 12:47:55 +0800, Jason Wang wrote: > >>>>>>>>>>>> On Thu, Apr 11, 2024 at 10:51=E2=80=AFAM Xuan Zhuo wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> Now, we chain the pages of big mode by the page's private v= ariable. > >>>>>>>>>>>>> But a subsequent patch aims to make the big mode to support > >>>>>>>>>>>>> premapped mode. This requires additional space to store the= dma addr. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Within the sub-struct that contains the 'private', there is= no suitable > >>>>>>>>>>>>> variable for storing the DMA addr. > >>>>>>>>>>>>> > >>>>>>>>>>>>> struct { /* Page cache and anonymou= s pages */ > >>>>>>>>>>>>> /** > >>>>>>>>>>>>> * @lru: Pageout list, eg. active_= list protected by > >>>>>>>>>>>>> * lruvec->lru_lock. Sometimes us= ed as a generic list > >>>>>>>>>>>>> * by the page owner. > >>>>>>>>>>>>> */ > >>>>>>>>>>>>> union { > >>>>>>>>>>>>> struct list_head lru; > >>>>>>>>>>>>> > >>>>>>>>>>>>> /* Or, for the Unevictable= "LRU list" slot */ > >>>>>>>>>>>>> struct { > >>>>>>>>>>>>> /* Always even, to= negate PageTail */ > >>>>>>>>>>>>> void *__filler; > >>>>>>>>>>>>> /* Count page's or= folio's mlocks */ > >>>>>>>>>>>>> unsigned int mlock= _count; > >>>>>>>>>>>>> }; > >>>>>>>>>>>>> > >>>>>>>>>>>>> /* Or, free page */ > >>>>>>>>>>>>> struct list_head buddy_lis= t; > >>>>>>>>>>>>> struct list_head pcp_list; > >>>>>>>>>>>>> }; > >>>>>>>>>>>>> /* See page-flags.h for PAGE_MAPPI= NG_FLAGS */ > >>>>>>>>>>>>> struct address_space *mapping; > >>>>>>>>>>>>> union { > >>>>>>>>>>>>> pgoff_t index; /*= Our offset within mapping. */ > >>>>>>>>>>>>> unsigned long share; /*= share count for fsdax */ > >>>>>>>>>>>>> }; > >>>>>>>>>>>>> /** > >>>>>>>>>>>>> * @private: Mapping-private opaqu= e data. > >>>>>>>>>>>>> * Usually used for buffer_heads i= f PagePrivate. > >>>>>>>>>>>>> * Used for swp_entry_t if PageSwa= pCache. > >>>>>>>>>>>>> * Indicates order in the buddy sy= stem if PageBuddy. > >>>>>>>>>>>>> */ > >>>>>>>>>>>>> unsigned long private; > >>>>>>>>>>>>> }; > >>>>>>>>>>>>> > >>>>>>>>>>>>> But within the page pool struct, we have a variable called > >>>>>>>>>>>>> dma_addr that is appropriate for storing dma addr. > >>>>>>>>>>>>> And that struct is used by netstack. That works to our adva= ntage. > >>>>>>>>>>>>> > >>>>>>>>>>>>> struct { /* page_pool used by netst= ack */ > >>>>>>>>>>>>> /** > >>>>>>>>>>>>> * @pp_magic: magic value to avoid= recycling non > >>>>>>>>>>>>> * page_pool allocated pages. > >>>>>>>>>>>>> */ > >>>>>>>>>>>>> unsigned long pp_magic; > >>>>>>>>>>>>> struct page_pool *pp; > >>>>>>>>>>>>> unsigned long _pp_mapping_pad; > >>>>>>>>>>>>> unsigned long dma_addr; > >>>>>>>>>>>>> atomic_long_t pp_ref_count; > >>>>>>>>>>>>> }; > >>>>>>>>>>>>> > >>>>>>>>>>>>> On the other side, we should use variables from the same su= b-struct. > >>>>>>>>>>>>> So this patch replaces the "private" with "pp". > >>>>>>>>>>>>> > >>>>>>>>>>>>> Signed-off-by: Xuan Zhuo > >>>>>>>>>>>>> --- > >>>>>>>>>>>> > >>>>>>>>>>>> Instead of doing a customized version of page pool, can we s= imply > >>>>>>>>>>>> switch to use page pool for big mode instead? Then we don't = need to > >>>>>>>>>>>> bother the dma stuffs. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> The page pool needs to do the dma by the DMA APIs. > >>>>>>>>>>> So we can not use the page pool directly. > >>>>>>>>>> > >>>>>>>>>> I found this: > >>>>>>>>>> > >>>>>>>>>> define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do t= he DMA > >>>>>>>>>> * map/unmap > >>>>>>>>>> > >>>>>>>>>> It seems to work here? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I have studied the page pool mechanism and believe that we cann= ot use it > >>>>>>>>> directly. We can make the page pool to bypass the DMA operation= s. > >>>>>>>>> This allows us to handle DMA within virtio-net for pages alloca= ted from the page > >>>>>>>>> pool. Furthermore, we can utilize page pool helpers to associat= e the DMA address > >>>>>>>>> to the page. > >>>>>>>>> > >>>>>>>>> However, the critical issue pertains to unmapping. Ideally, we = want to return > >>>>>>>>> the mapped pages to the page pool and reuse them. In doing so, = we can omit the > >>>>>>>>> unmapping and remapping steps. > >>>>>>>>> > >>>>>>>>> Currently, there's a caveat: when the page pool cache is full, = it disconnects > >>>>>>>>> and releases the pages. When the pool hits its capacity, pages = are relinquished > >>>>>>>>> without a chance for unmapping. > > Could Jakub's memory provider for PP help your use-case? > > See: [1] > https://lore.kernel.org/all/20240403002053.2376017-3-almasrymina@google.c= om/ > Or: [2] > https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redha= t.com/T/ It can not. That make the pp can get page by the callbacks. Here we talk about the map/unmap. The virtio-net has the different DMA APIs. dma_addr_t virtqueue_dma_map_single_attrs(struct virtqueue *_vq, void *ptr= , size_t size, enum dma_data_direction dir, unsigned long attrs); void virtqueue_dma_unmap_single_attrs(struct virtqueue *_vq, dma_addr_t ad= dr, size_t size, enum dma_data_direction dir, unsigned long attrs); dma_addr_t virtqueue_dma_map_page_attrs(struct virtqueue *_vq, struct page= *page, size_t offset, size_t size, enum dma_data_direction dir, unsigned long attrs); void virtqueue_dma_unmap_page_attrs(struct virtqueue *_vq, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs); int virtqueue_dma_mapping_error(struct virtqueue *_vq, dma_addr_t addr); bool virtqueue_dma_need_sync(struct virtqueue *_vq, dma_addr_t addr); void virtqueue_dma_sync_single_range_for_cpu(struct virtqueue *_vq, dma_ad= dr_t addr, unsigned long offset, size_t size, enum dma_data_direction dir); void virtqueue_dma_sync_single_range_for_device(struct virtqueue *_vq, dma= _addr_t addr, unsigned long offset, size_t size, enum dma_data_direction dir); Thanks. > > > [...] > >>>>>> > >>>>>> Adding Jesper for some comments. > >>>>>> > >>>>>>> > >>>>>>> Back to this patch set, I think we should keep the virtio-net to = manage > >>>>>>> the pages. > >>>>>>> > > For context the patch: > [3] > https://lore.kernel.org/all/20240411025127.51945-4-xuanzhuo@linux.alibaba= .com/ > > >>>>>>> What do you think? > >>>>>> > >>>>>> I might be wrong, but I think if we need to either > >>>>>> > >>>>>> 1) seek a way to manage the pages by yourself but not touching page > >>>>>> pool metadata (or Jesper is fine with this) > >>>>> > >>>>> Do you mean working with page pool or not? > >>>>> > >>>> > >>>> I meant if Jesper is fine with reusing page pool metadata like this = patch. > >>>> > >>>>> If we manage the pages by self(no page pool), we do not care the me= tadata is for > >>>>> page pool or not. We just use the space of pages like the "private". > >>>> > >>>> That's also fine. > >>>> > > I'm not sure it is "fine" to, explicitly choosing not to use page pool, > and then (ab)use `struct page` member (pp) that intended for page_pool > for other stuff. (In this case create a linked list of pages). > > +#define page_chain_next(p) ((struct page *)((p)->pp)) > +#define page_chain_add(p, n) ((p)->pp =3D (void *)n) > > I'm not sure that I (as PP maintainer) can make this call actually, as I > think this area belong with the MM "page" maintainers (Cc MM-list + > people) to judge. > > Just invention new ways to use struct page fields without adding your > use-case to struct page, will make it harder for MM people to maintain > (e.g. make future change). > > --Jesper > >