From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx195.postini.com [74.125.245.195]) by kanga.kvack.org (Postfix) with SMTP id DF9276B0002 for ; Tue, 19 Feb 2013 20:58:57 -0500 (EST) Received: by mail-lb0-f182.google.com with SMTP id gg6so5624649lbb.27 for ; Tue, 19 Feb 2013 17:58:55 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <5123BC4D.1010404@linux.vnet.ibm.com> References: <1360780731-11708-1-git-send-email-sjenning@linux.vnet.ibm.com> <1360780731-11708-2-git-send-email-sjenning@linux.vnet.ibm.com> <20130219091804.GA13989@lge.com> <5123BC4D.1010404@linux.vnet.ibm.com> Date: Tue, 19 Feb 2013 17:58:55 -0800 Message-ID: Subject: Re: [PATCHv5 1/8] zsmalloc: add to mm/ From: Nitin Gupta Content-Type: multipart/alternative; boundary=bcaec5524414fb029f04d61e4c20 Sender: owner-linux-mm@kvack.org List-ID: To: Seth Jennings Cc: Joonsoo Kim , Andrew Morton , Greg Kroah-Hartman , Minchan Kim , Konrad Rzeszutek Wilk , Dan Magenheimer , Robert Jennings , Jenifer Hopper , Mel Gorman , Johannes Weiner , Rik van Riel , Larry Woodman , Benjamin Herrenschmidt , Dave Hansen , Joe Perches , linux-mm , linux-kernel , Linux Driver Project --bcaec5524414fb029f04d61e4c20 Content-Type: text/plain; charset=ISO-8859-1 On Tue, Feb 19, 2013 at 9:54 AM, Seth Jennings wrote: > On 02/19/2013 03:18 AM, Joonsoo Kim wrote: > > Hello, Seth. > > I'm not sure that this is right time to review, because I already have > > seen many effort of various people to promote zxxx series. I don't want > to > > be a stopper to promote these. :) > > Any time is good review time :) Thanks for your review! > > > > > But, I read the code, now, and then some comments below. > > > > On Wed, Feb 13, 2013 at 12:38:44PM -0600, Seth Jennings wrote: > >> ========= > >> DO NOT MERGE, FOR REVIEW ONLY > >> This patch introduces zsmalloc as new code, however, it already > >> exists in drivers/staging. In order to build successfully, you > >> must select EITHER to driver/staging version OR this version. > >> Once zsmalloc is reviewed in this format (and hopefully accepted), > >> I will create a new patchset that properly promotes zsmalloc from > >> staging. > >> ========= > >> > >> This patchset introduces a new slab-based memory allocator, > >> zsmalloc, for storing compressed pages. It is designed for > >> low fragmentation and high allocation success rate on > >> large object, but <= PAGE_SIZE allocations. > >> > >> zsmalloc differs from the kernel slab allocator in two primary > >> ways to achieve these design goals. > >> > >> zsmalloc never requires high order page allocations to back > >> slabs, or "size classes" in zsmalloc terms. Instead it allows > >> multiple single-order pages to be stitched together into a > >> "zspage" which backs the slab. This allows for higher allocation > >> success rate under memory pressure. > >> > >> Also, zsmalloc allows objects to span page boundaries within the > >> zspage. This allows for lower fragmentation than could be had > >> with the kernel slab allocator for objects between PAGE_SIZE/2 > >> and PAGE_SIZE. With the kernel slab allocator, if a page compresses > >> to 60% of it original size, the memory savings gained through > >> compression is lost in fragmentation because another object of > >> the same size can't be stored in the leftover space. > >> > >> This ability to span pages results in zsmalloc allocations not being > >> directly addressable by the user. The user is given an > >> non-dereferencable handle in response to an allocation request. > >> That handle must be mapped, using zs_map_object(), which returns > >> a pointer to the mapped region that can be used. The mapping is > >> necessary since the object data may reside in two different > >> noncontigious pages. > >> > >> zsmalloc fulfills the allocation needs for zram and zswap. > >> > >> Acked-by: Nitin Gupta > >> Acked-by: Minchan Kim > >> Signed-off-by: Seth Jennings > >> --- > >> include/linux/zsmalloc.h | 49 ++ > >> mm/Kconfig | 24 + > >> mm/Makefile | 1 + > >> mm/zsmalloc.c | 1124 > ++++++++++++++++++++++++++++++++++++++++++++++ > >> 4 files changed, 1198 insertions(+) > >> create mode 100644 include/linux/zsmalloc.h > >> create mode 100644 mm/zsmalloc.c > >> > >> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h > >> new file mode 100644 > >> index 0000000..eb6efb6 > >> --- /dev/null > >> +++ b/include/linux/zsmalloc.h > >> @@ -0,0 +1,49 @@ > >> +/* > >> + * zsmalloc memory allocator > >> + * > >> + * Copyright (C) 2011 Nitin Gupta > >> + * > >> + * This code is released using a dual license strategy: BSD/GPL > >> + * You can choose the license that better fits your requirements. > >> + * > >> + * Released under the terms of 3-clause BSD License > >> + * Released under the terms of GNU General Public License Version 2.0 > >> + */ > >> + > >> +#ifndef _ZS_MALLOC_H_ > >> +#define _ZS_MALLOC_H_ > >> + > >> +#include > >> +#include > >> + > >> +/* > >> + * zsmalloc mapping modes > >> + * > >> + * NOTE: These only make a difference when a mapped object spans pages > >> +*/ > >> +enum zs_mapmode { > >> + ZS_MM_RW, /* normal read-write mapping */ > >> + ZS_MM_RO, /* read-only (no copy-out at unmap time) */ > >> + ZS_MM_WO /* write-only (no copy-in at map time) */ > >> +}; > > > > > > These makes no difference for PGTABLE_MAPPING. > > Please add some comment for this. > > Yes. Will do. > > > > >> +struct zs_ops { > >> + struct page * (*alloc)(gfp_t); > >> + void (*free)(struct page *); > >> +}; > >> + > >> +struct zs_pool; > >> + > >> +struct zs_pool *zs_create_pool(gfp_t flags, struct zs_ops *ops); > >> +void zs_destroy_pool(struct zs_pool *pool); > >> + > >> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t > flags); > >> +void zs_free(struct zs_pool *pool, unsigned long obj); > >> + > >> +void *zs_map_object(struct zs_pool *pool, unsigned long handle, > >> + enum zs_mapmode mm); > >> +void zs_unmap_object(struct zs_pool *pool, unsigned long handle); > >> + > >> +u64 zs_get_total_size_bytes(struct zs_pool *pool); > >> + > >> +#endif > >> diff --git a/mm/Kconfig b/mm/Kconfig > >> index 278e3ab..25b8f38 100644 > >> --- a/mm/Kconfig > >> +++ b/mm/Kconfig > >> @@ -446,3 +446,27 @@ config FRONTSWAP > >> and swap data is stored as normal on the matching swap device. > >> > >> If unsure, say Y to enable frontswap. > >> + > >> +config ZSMALLOC > >> + tristate "Memory allocator for compressed pages" > >> + default n > >> + help > >> + zsmalloc is a slab-based memory allocator designed to store > >> + compressed RAM pages. zsmalloc uses virtual memory mapping > >> + in order to reduce fragmentation. However, this results in a > >> + non-standard allocator interface where a handle, not a pointer, > is > >> + returned by an alloc(). This handle must be mapped in order to > >> + access the allocated space. > >> + > >> +config PGTABLE_MAPPING > >> + bool "Use page table mapping to access object in zsmalloc" > >> + depends on ZSMALLOC > >> + help > >> + By default, zsmalloc uses a copy-based object mapping method to > >> + access allocations that span two pages. However, if a particular > >> + architecture (ex, ARM) performs VM mapping faster than copying, > >> + then you should select this. This causes zsmalloc to use page > table > >> + mapping rather than copying for object mapping. > >> + > >> + You can check speed with zsmalloc benchmark[1]. > >> + [1] https://github.com/spartacus06/zsmalloc > >> diff --git a/mm/Makefile b/mm/Makefile > >> index 3a46287..0f6ef0a 100644 > >> --- a/mm/Makefile > >> +++ b/mm/Makefile > >> @@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o > >> obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o > >> obj-$(CONFIG_CLEANCACHE) += cleancache.o > >> obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o > >> +obj-$(CONFIG_ZSMALLOC) += zsmalloc.o > >> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c > >> new file mode 100644 > >> index 0000000..34378ef > >> --- /dev/null > >> +++ b/mm/zsmalloc.c > >> @@ -0,0 +1,1124 @@ > >> +/* > >> + * zsmalloc memory allocator > >> + * > >> + * Copyright (C) 2011 Nitin Gupta > >> + * > >> + * This code is released using a dual license strategy: BSD/GPL > >> + * You can choose the license that better fits your requirements. > >> + * > >> + * Released under the terms of 3-clause BSD License > >> + * Released under the terms of GNU General Public License Version 2.0 > >> + */ > >> + > >> + > >> +/* > >> + * This allocator is designed for use with zcache and zram. Thus, the > >> + * allocator is supposed to work well under low memory conditions. In > >> + * particular, it never attempts higher order page allocation which is > >> + * very likely to fail under memory pressure. On the other hand, if we > >> + * just use single (0-order) pages, it would suffer from very high > >> + * fragmentation -- any object of size PAGE_SIZE/2 or larger would > occupy > >> + * an entire page. This was one of the major issues with its > predecessor > >> + * (xvmalloc). > >> + * > >> + * To overcome these issues, zsmalloc allocates a bunch of 0-order > pages > >> + * and links them together using various 'struct page' fields. These > linked > >> + * pages act as a single higher-order page i.e. an object can span > 0-order > >> + * page boundaries. The code refers to these linked pages as a single > entity > >> + * called zspage. > >> + * > >> + * For simplicity, zsmalloc can only allocate objects of size up to > PAGE_SIZE > >> + * since this satisfies the requirements of all its current users (in > the > >> + * worst case, page is incompressible and is thus stored "as-is" i.e. > in > >> + * uncompressed form). For allocation requests larger than this size, > failure > >> + * is returned (see zs_malloc). > >> + * > >> + * Additionally, zs_malloc() does not return a dereferenceable pointer. > >> + * Instead, it returns an opaque handle (unsigned long) which encodes > actual > >> + * location of the allocated object. The reason for this indirection > is that > >> + * zsmalloc does not keep zspages permanently mapped since that would > cause > >> + * issues on 32-bit systems where the VA region for kernel space > mappings > >> + * is very small. So, before using the allocating memory, the object > has to > >> + * be mapped using zs_map_object() to get a usable pointer and > subsequently > >> + * unmapped using zs_unmap_object(). > >> + * > >> + * Following is how we use various fields and flags of underlying > >> + * struct page(s) to form a zspage. > >> + * > >> + * Usage of struct page fields: > >> + * page->first_page: points to the first component (0-order) page > >> + * page->index (union with page->freelist): offset of the first object > >> + * starting in this page. For the first page, this is > >> + * always 0, so we use this field (aka freelist) to point > >> + * to the first free object in zspage. > >> + * page->lru: links together all component pages (except the first > page) > >> + * of a zspage > >> + * > >> + * For _first_ page only: > >> + * > >> + * page->private (union with page->first_page): refers to the > >> + * component page after the first page > >> + * page->freelist: points to the first free object in zspage. > >> + * Free objects are linked together using in-place > >> + * metadata. > >> + * page->objects: maximum number of objects we can store in this > >> + * zspage (class->zspage_order * PAGE_SIZE / class->size) > > > > How about just embedding maximum number of objects to size_class? > > For the SLUB, each slab can have difference number of objects. > > But, for the zsmalloc, it is not possible, so there is no reason > > to maintain it within metadata of zspage. Just to embed it to size_class > > is sufficient. > > Yes, a little code massaging and this can go away. > > However, there might be some value in having variable sized zspages in > the same size_class. It could improve allocation success rate at the > expense of efficiency by not failing in alloc_zspage() if we can't > allocate the optimal number of pages. As long as we can allocate the > first page, then we can proceed. > Yes, I remember trying to allow partial failures and thus allow variable sized zspages within the same size class but I just skipped that since it seemed like non-trivial to do and the value wasn't clear: if the system cannot give us 4 (non-contiguous) pages then it's probably going to fail allocation requests for single pages also in not so far future. Also, for objects > PAGESIZE/2, failing over to zspage of just PAGESIZE is not going to help either. > > > > > > >> + * page->lru: links together first pages of various zspages. > >> + * Basically forming list of zspages in a fullness group. > >> + * page->mapping: class index and fullness group of the zspage > >> + * > >> + * Usage of struct page flags: > >> + * PG_private: identifies the first component page > >> + * PG_private2: identifies the last component page > >> + * > > >> + */ > >> +struct size_class { > >> + /* > >> + * Size of objects stored in this class. Must be multiple > >> + * of ZS_ALIGN. > >> + */ > >> + int size; > >> + unsigned int index; > >> + > >> + /* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */ > >> + int pages_per_zspage; > >> + > >> + spinlock_t lock; > >> + > >> + /* stats */ > >> + u64 pages_allocated; > >> + > >> + struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS]; > >> +}; > > > > Instead of simple pointer, how about using list_head? > > With this, fullness_list management is easily consolidated to > > set_zspage_mapping() and we can remove remove_zspage(), insert_zspage(). > > > Yes, this looks like a nice cleanup. Still, I would hold such changes till we get out of staging. Thanks for your comments, Kim. Nitin --bcaec5524414fb029f04d61e4c20 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable



On Tue, Feb 19, 2013 at 9:54 AM, Seth Jennings &l= t;sjenning= @linux.vnet.ibm.com> wrote:
On 02/19/2013 03:18 AM, Jo= onsoo Kim wrote:
> Hello, Seth.
> I'm not sure that this is right time to review, because I already = have
> seen many effort of various people to promote zxxx series. I don't= want to
> be a stopper to promote these. :)

Any time is good review time :) =A0Thanks for your review!

>
> But, I read the code, now, and then some comments below.
>
> On Wed, Feb 13, 2013 at 12:38:44PM -0600, Seth Jennings wrote:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D
>> DO NOT MERGE, FOR REVIEW ONLY
>> This patch introduces zsmalloc as new code, however, it already >> exists in drivers/staging. =A0In order to build successfully, you<= br> >> must select EITHER to driver/staging version OR this version.
>> Once zsmalloc is reviewed in this format (and hopefully accepted),=
>> I will create a new patchset that properly promotes zsmalloc from<= br> >> staging.
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D
>>
>> This patchset introduces a new slab-based memory allocator,
>> zsmalloc, for storing compressed pages. =A0It is designed for
>> low fragmentation and high allocation success rate on
>> large object, but <=3D PAGE_SIZE allocations.
>>
>> zsmalloc differs from the kernel slab allocator in two primary
>> ways to achieve these design goals.
>>
>> zsmalloc never requires high order page allocations to back
>> slabs, or "size classes" in zsmalloc terms. Instead it a= llows
>> multiple single-order pages to be stitched together into a
>> "zspage" which backs the slab. =A0This allows for higher= allocation
>> success rate under memory pressure.
>>
>> Also, zsmalloc allows objects to span page boundaries within the >> zspage. =A0This allows for lower fragmentation than could be had >> with the kernel slab allocator for objects between PAGE_SIZE/2
>> and PAGE_SIZE. =A0With the kernel slab allocator, if a page compre= sses
>> to 60% of it original size, the memory savings gained through
>> compression is lost in fragmentation because another object of
>> the same size can't be stored in the leftover space.
>>
>> This ability to span pages results in zsmalloc allocations not bei= ng
>> directly addressable by the user. =A0The user is given an
>> non-dereferencable handle in response to an allocation request. >> That handle must be mapped, using zs_map_object(), which returns >> a pointer to the mapped region that can be used. =A0The mapping is=
>> necessary since the object data may reside in two different
>> noncontigious pages.
>>
>> zsmalloc fulfills the allocation needs for zram and zswap.
>>
>> Acked-by: Nitin Gupta <ngu= pta@vflare.org>
>> Acked-by: Minchan Kim <mi= nchan@kernel.org>
>> Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
>> ---
>> =A0include/linux/zsmalloc.h | =A0 49 ++
>> =A0mm/Kconfig =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 24 +
>> =A0mm/Makefile =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A01 +
>> =A0mm/zsmalloc.c =A0 =A0 =A0 =A0 =A0 =A0| 1124 +++++++++++++++++++= +++++++++++++++++++++++++++
>> =A04 files changed, 1198 insertions(+)
>> =A0create mode 100644 include/linux/zsmalloc.h
>> =A0create mode 100644 mm/zsmalloc.c
>>
>> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h >> new file mode 100644
>> index 0000000..eb6efb6
>> --- /dev/null
>> +++ b/include/linux/zsmalloc.h
>> @@ -0,0 +1,49 @@
>> +/*
>> + * zsmalloc memory allocator
>> + *
>> + * Copyright (C) 2011 =A0Nitin Gupta
>> + *
>> + * This code is released using a dual license strategy: BSD/GPL >> + * You can choose the license that better fits your requirements.=
>> + *
>> + * Released under the terms of 3-clause BSD License
>> + * Released under the terms of GNU General Public License Version= 2.0
>> + */
>> +
>> +#ifndef _ZS_MALLOC_H_
>> +#define _ZS_MALLOC_H_
>> +
>> +#include <linux/types.h>
>> +#include <linux/mm_types.h>
>> +
>> +/*
>> + * zsmalloc mapping modes
>> + *
>> + * NOTE: These only make a difference when a mapped object spans = pages
>> +*/
>> +enum zs_mapmode {
>> + =A0 =A0ZS_MM_RW, /* normal read-write mapping */
>> + =A0 =A0ZS_MM_RO, /* read-only (no copy-out at unmap time) */
>> + =A0 =A0ZS_MM_WO /* write-only (no copy-in at map time) */
>> +};
>
>
> These makes no difference for PGTABLE_MAPPING.
> Please add some comment for this.

Yes. Will do.

>
>> +struct zs_ops {
>> + =A0 =A0struct page * (*alloc)(gfp_t);
>> + =A0 =A0void (*free)(struct page *);
>> +};
>> +
>> +struct zs_pool;
>> +
>> +struct zs_pool *zs_create_pool(gfp_t flags, struct zs_ops *ops);<= br> >> +void zs_destroy_pool(struct zs_pool *pool);
>> +
>> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t = flags);
>> +void zs_free(struct zs_pool *pool, unsigned long obj);
>> +
>> +void *zs_map_object(struct zs_pool *pool, unsigned long handle, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0enum zs_mapmode mm);
>> +void zs_unmap_object(struct zs_pool *pool, unsigned long handle);=
>> +
>> +u64 zs_get_total_size_bytes(struct zs_pool *pool);
>> +
>> +#endif
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 278e3ab..25b8f38 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -446,3 +446,27 @@ config FRONTSWAP
>> =A0 =A0 =A0 =A0and swap data is stored as normal on the matching s= wap device.
>>
>> =A0 =A0 =A0 =A0If unsure, say Y to enable frontswap.
>> +
>> +config ZSMALLOC
>> + =A0 =A0tristate "Memory allocator for compressed pages"= ;
>> + =A0 =A0default n
>> + =A0 =A0help
>> + =A0 =A0 =A0zsmalloc is a slab-based memory allocator designed to= store
>> + =A0 =A0 =A0compressed RAM pages. =A0zsmalloc uses virtual memory= mapping
>> + =A0 =A0 =A0in order to reduce fragmentation. =A0However, this re= sults in a
>> + =A0 =A0 =A0non-standard allocator interface where a handle, not = a pointer, is
>> + =A0 =A0 =A0returned by an alloc(). =A0This handle must be mapped= in order to
>> + =A0 =A0 =A0access the allocated space.
>> +
>> +config PGTABLE_MAPPING
>> + =A0 =A0bool "Use page table mapping to access object in zsm= alloc"
>> + =A0 =A0depends on ZSMALLOC
>> + =A0 =A0help
>> + =A0 =A0 =A0By default, zsmalloc uses a copy-based object mapping= method to
>> + =A0 =A0 =A0access allocations that span two pages. However, if a= particular
>> + =A0 =A0 =A0architecture (ex, ARM) performs VM mapping faster tha= n copying,
>> + =A0 =A0 =A0then you should select this. This causes zsmalloc to = use page table
>> + =A0 =A0 =A0mapping rather than copying for object mapping.
>> +
>> + =A0 =A0 =A0You can check speed with zsmalloc benchmark[1].
>> + =A0 =A0 =A0[1] https://github.com/spartacus06/zsmalloc
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 3a46287..0f6ef0a 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) +=3D kmemleak.o
>> =A0obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) +=3D kmemleak-test.o
>> =A0obj-$(CONFIG_CLEANCACHE) +=3D cleancache.o
>> =A0obj-$(CONFIG_MEMORY_ISOLATION) +=3D page_isolation.o
>> +obj-$(CONFIG_ZSMALLOC) =A0 =A0 =A0+=3D zsmalloc.o
>> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> new file mode 100644
>> index 0000000..34378ef
>> --- /dev/null
>> +++ b/mm/zsmalloc.c
>> @@ -0,0 +1,1124 @@
>> +/*
>> + * zsmalloc memory allocator
>> + *
>> + * Copyright (C) 2011 =A0Nitin Gupta
>> + *
>> + * This code is released using a dual license strategy: BSD/GPL >> + * You can choose the license that better fits your requirements.=
>> + *
>> + * Released under the terms of 3-clause BSD License
>> + * Released under the terms of GNU General Public License Version= 2.0
>> + */
>> +
>> +
>> +/*
>> + * This allocator is designed for use with zcache and zram. Thus,= the
>> + * allocator is supposed to work well under low memory conditions= . In
>> + * particular, it never attempts higher order page allocation whi= ch is
>> + * very likely to fail under memory pressure. On the other hand, = if we
>> + * just use single (0-order) pages, it would suffer from very hig= h
>> + * fragmentation -- any object of size PAGE_SIZE/2 or larger woul= d occupy
>> + * an entire page. This was one of the major issues with its pred= ecessor
>> + * (xvmalloc).
>> + *
>> + * To overcome these issues, zsmalloc allocates a bunch of 0-orde= r pages
>> + * and links them together using various 'struct page' fi= elds. These linked
>> + * pages act as a single higher-order page i.e. an object can spa= n 0-order
>> + * page boundaries. The code refers to these linked pages as a si= ngle entity
>> + * called zspage.
>> + *
>> + * For simplicity, zsmalloc can only allocate objects of size up = to PAGE_SIZE
>> + * since this satisfies the requirements of all its current users= (in the
>> + * worst case, page is incompressible and is thus stored "as= -is" i.e. in
>> + * uncompressed form). For allocation requests larger than this s= ize, failure
>> + * is returned (see zs_malloc).
>> + *
>> + * Additionally, zs_malloc() does not return a dereferenceable po= inter.
>> + * Instead, it returns an opaque handle (unsigned long) which enc= odes actual
>> + * location of the allocated object. The reason for this indirect= ion is that
>> + * zsmalloc does not keep zspages permanently mapped since that w= ould cause
>> + * issues on 32-bit systems where the VA region for kernel space = mappings
>> + * is very small. So, before using the allocating memory, the obj= ect has to
>> + * be mapped using zs_map_object() to get a usable pointer and su= bsequently
>> + * unmapped using zs_unmap_object().
>> + *
>> + * Following is how we use various fields and flags of underlying=
>> + * struct page(s) to form a zspage.
>> + *
>> + * Usage of struct page fields:
>> + * =A0page->first_page: points to the first component (0-order= ) page
>> + * =A0page->index (union with page->freelist): offset of th= e first object
>> + * =A0 =A0 =A0 =A0 =A0starting in this page. For the first page, = this is
>> + * =A0 =A0 =A0 =A0 =A0always 0, so we use this field (aka freelis= t) to point
>> + * =A0 =A0 =A0 =A0 =A0to the first free object in zspage.
>> + * =A0page->lru: links together all component pages (except th= e first page)
>> + * =A0 =A0 =A0 =A0 =A0of a zspage
>> + *
>> + * =A0For _first_ page only:
>> + *
>> + * =A0page->private (union with page->first_page): refers t= o the
>> + * =A0 =A0 =A0 =A0 =A0component page after the first page
>> + * =A0page->freelist: points to the first free object in zspag= e.
>> + * =A0 =A0 =A0 =A0 =A0Free objects are linked together using in-p= lace
>> + * =A0 =A0 =A0 =A0 =A0metadata.
>> + * =A0page->objects: maximum number of objects we can store in= this
>> + * =A0 =A0 =A0 =A0 =A0zspage (class->zspage_order * PAGE_SIZE = / class->size)
>
> How about just embedding maximum number of objects to size_class?
> For the SLUB, each slab can have difference number of objects.
> But, for the zsmalloc, it is not possible, so there is no reason
> to maintain it within metadata of zspage. Just to embed it to size_cla= ss
> is sufficient.

Yes, a little code massaging and this can go away.

However, there might be some value in having variable sized zspages in
the same size_class. =A0It could improve allocation success rate at the
expense of efficiency by not failing in alloc_zspage() if we can't
allocate the optimal number of pages. =A0As long as we can allocate the
first page, then we can proceed.

=
Yes, I remember trying to allow partial failures and t= hus allow variable
sized zspages within the same size class= but I just skipped that since
it seemed like non-trivial to do and the value wasn't clear:= if the
system cannot give us 4 (non-contiguous) pages then= it's probably
going to fail allocation requests for si= ngle pages also in not so far
future. Also, for objects > PAGESIZE/2, failing over to zspag= e
of just PAGESIZE is not going to help either.
=A0
=A0
>
>
>> + * =A0page->lru: links together first pages of various zspages= .
>> + * =A0 =A0 =A0 =A0 =A0Basically forming list of zspages in a full= ness group.
>> + * =A0page->mapping: class index and fullness group of the zsp= age
>> + *
>> + * Usage of struct page flags:
>> + * =A0PG_private: identifies the first component page
>> + * =A0PG_private2: identifies the last component page
>> + *
<snip>
=A0
>> + */
>> +struct size_class {
>> + =A0 =A0/*
>> + =A0 =A0 * Size of objects stored in this class. Must be multiple=
>> + =A0 =A0 * of ZS_ALIGN.
>> + =A0 =A0 */
>> + =A0 =A0int size;
>> + =A0 =A0unsigned int index;
>> +
>> + =A0 =A0/* Number of PAGE_SIZE sized pages to combine to form a &= #39;zspage' */
>> + =A0 =A0int pages_per_zspage;
>> +
>> + =A0 =A0spinlock_t lock;
>> +
>> + =A0 =A0/* stats */
>> + =A0 =A0u64 pages_allocated;
>> +
>> + =A0 =A0struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS];
>> +};
>
> Instead of simple pointer, how about using list_head?
> With this, fullness_list management is easily consolidated to
> set_zspage_mapping() and we can remove remove_zspage(), insert_zspage(= ).



=A0
Yes, thi= s looks like a nice cleanup.
Still, I would hold such changes til= l we get out of staging.


Thanks for your comments, Kim.


Nitin

--bcaec5524414fb029f04d61e4c20-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org