From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id D18816B0005 for ; Fri, 1 Jun 2018 19:34:05 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id k27-v6so19312526wre.23 for ; Fri, 01 Jun 2018 16:34:05 -0700 (PDT) Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id n128-v6sor797098wma.46.2018.06.01.16.34.03 for (Google Transport Security); Fri, 01 Jun 2018 16:34:03 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <55be03eb-3d0d-d43d-b0a4-669341e6d9ab@redhat.com> <20180601205837.GB29651@bombadil.infradead.org> From: Anton Eidelman Date: Fri, 1 Jun 2018 16:34:01 -0700 Message-ID: Subject: Re: HARDENED_USERCOPY will BUG on multiple slub objects coalesced into an sk_buff fragment Content-Type: multipart/alternative; boundary="000000000000e771d7056d9d044a" Sender: owner-linux-mm@kvack.org List-ID: To: Kees Cook Cc: Matthew Wilcox , Laura Abbott , Linux-MM , linux-hardened@lists.openwall.com --000000000000e771d7056d9d044a Content-Type: text/plain; charset="UTF-8" Hi all, I do not have a way of reproducing this decent enough to recommend: I'll keep digging. The page belongs to a slub when the fragment is being constructed in __skb_fill_page_desc(), see the instrumentation I used below. When usercopy triggers, .coals shows values of 2/3 for 128/192 bytes respectively. The question is how the RX sk_buff ends up having data fragment in a PageSlab page. Some network drivers use netdev_alloc_frag() so pages indeed come from page_frag allocator. Others (mellanox, intel) just alloc_page() when filling their RX descriptors. In both cases the pages will be refcounted properly. I suspect my kernel TCP traffic that uses kernel_sendpage() for bio pages AND slub pages. Thanks a lot! Anton diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a098d95..7cd744c 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -40,6 +40,7 @@ #include #include #include +#include /* The interface for checksum offload between the stack and networking drivers * is as follows... @@ -316,7 +317,8 @@ struct skb_frag_struct { } page; #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536) __u32 page_offset; - __u32 size; + __u16 size; + __u16 coals; #else __u16 page_offset; __u16 size; @@ -1850,9 +1852,11 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i, */ frag->page.p = page; frag->page_offset = off; + frag->coals = 0; skb_frag_size_set(frag, size); page = compound_head(page); + *WARN_ON(PageSlab(page) && (page->slab_cache->size < size)); // does NOT trigger* if (page_is_pfmemalloc(page)) skb->pfmemalloc = true; } @@ -2849,10 +2853,14 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i, const struct page *page, int off) { if (i) { - const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1]; + struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1]; - return page == skb_frag_page(frag) && + bool ret = page == skb_frag_page(frag) && off == frag->page_offset + skb_frag_size(frag); + if (unlikely(ret)) *+ if (PageSlab(compound_head((struct page *)page)))* *+ frag->coals++;* + return ret; } return false; } On Fri, Jun 1, 2018 at 2:55 PM, Kees Cook wrote: > On Fri, Jun 1, 2018 at 1:58 PM, Matthew Wilcox > wrote: > > On Fri, Jun 01, 2018 at 01:49:38PM -0700, Kees Cook wrote: > >> On Fri, Jun 1, 2018 at 12:02 PM, Laura Abbott > wrote: > >> > (cc-ing some interested people) > >> > > >> > > >> > > >> > On 05/31/2018 05:03 PM, Anton Eidelman wrote: > >> >> Here's a rare issue I reproduce on 4.12.10 (centos config): full log > >> >> sample below. > >> > >> Thanks for digging into this! Do you have any specific reproducer for > >> this? If so, I'd love to try a bisection, as I'm surprised this has > >> only now surfaced: hardened usercopy was introduced in 4.8 ... > >> > >> >> An innocent process (dhcpclient) is about to receive a datagram, but > >> >> during skb_copy_datagram_iter() usercopy triggers a BUG in: > >> >> usercopy.c:check_heap_object() -> slub.c:__check_heap_object(), > because > >> >> the sk_buff fragment being copied crosses the 64-byte slub object > boundary. > >> >> > >> >> Example __check_heap_object() context: > >> >> n=128 << usually 128, sometimes 192. > >> >> object_size=64 > >> >> s->size=64 > >> >> page_address(page)=0xffff880233f7c000 > >> >> ptr=0xffff880233f7c540 > >> >> > >> >> My take on the root cause: > >> >> When adding data to an skb, new data is appended to the current > >> >> fragment if the new chunk immediately follows the last one: by simply > >> >> increasing the frag->size, skb_frag_size_add(). > >> >> See include/linux/skbuff.h:skb_can_coalesce() callers. > >> > >> Oooh, sneaky: > >> return page == skb_frag_page(frag) && > >> off == frag->page_offset + skb_frag_size(frag); > >> > >> Originally I was thinking that slab red-zoning would get triggered > >> too, but I see the above is checking to see if these are precisely > >> neighboring allocations, I think. > >> > >> But then ... how does freeing actually work? I'm really not sure how > >> this seeming layering violation could be safe in other areas? > > > > I'm confused ... I thought skb frags came from the page_frag allocator, > > not the slab allocator. But then why would the slab hardening trigger? > > Well that would certainly make more sense (well, the sense about > alloc/free). Having it overlap with a slab allocation, though, that's > quite bad. Perhaps this is a very odd use-after-free case? I.e. freed > page got allocated to slab, and when it got copied out, usercopy found > it spanned a slub object? > > [ 655.602500] usercopy: kernel memory exposure attempt detected from > ffff88022a31aa00 (kmalloc-64) (192 bytes) > > This wouldn't be the first time usercopy triggered due to a memory > corruption... > > -Kees > > -- > Kees Cook > Pixel Security > --000000000000e771d7056d9d044a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi all,

I do not have a way of reproduc= ing this decent enough to recommend: I'll keep digging.

<= /div>
The page belongs to a slub when the fragment is being constructed= =C2=A0in=C2=A0__skb_fill_page_desc(),=C2=A0see the instrumentation I us= ed below.
When usercopy triggers, .coals shows values of 2/3 for = 128/192 bytes respectively.

=
The question is how the RX sk_buff end= s up having data fragment in a PageSlab page.
Some network drivers use=C2=A0netdev_alloc_frag() so pages i= ndeed come from=C2=A0page_frag allocator.
Others (mellanox, intel) just=C2=A0alloc_page() when filling thei= r RX descriptors.
In both cases = the pages will be refcounted properly.

I suspect my kernel = TCP traffic that uses kernel_sendpage() for bio pages AND slub pages.

Thanks a lot!
Anton
dif= f --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a= 098d95..7cd744c 100644
--- a/include/linux/skbuff.h
+++= b/include/linux/skbuff.h
@@ -40,6 +40,7 @@
=C2=A0#incl= ude <linux/in6.h>
=C2=A0#include <linux/if_packet.h><= /div>
=C2=A0#include <net/flow.h>
+#include <linux/s= lub_def.h>
=C2=A0
=C2=A0/* The interface for checksu= m offload between the stack and networking drivers
=C2=A0 * is as= follows...
@@ -316,7 +317,8 @@ struct skb_frag_struct {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 } page;
=C2=A0#if (BITS_PER_LONG &g= t; 32) || (PAGE_SIZE >=3D 65536)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 _= _u32 page_offset;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0__u32 size;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0__u16 size;
+=C2=A0 =C2=A0 =C2=A0 = =C2=A0__u16 coals;
=C2=A0#else
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 __u16 page_offset;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 __u16 size;
@@ -1850,9 +1852,11 @@ static inline void __skb_fill_page_desc(stru= ct sk_buff *skb, int i,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 frag->page.p=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =3D page;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 fr= ag->page_offset=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D off;
+=C2= =A0 =C2=A0 =C2=A0 =C2=A0frag->coals=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0=3D 0;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 skb_frag_s= ize_set(frag, size);
=C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0= page =3D compound_head(page);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0WAR= N_ON(PageSlab(page) && (page->slab_cache->size < size)); /= / does NOT trigger
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (page_is_pf= memalloc(page))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 skb->pfmemalloc =3D true;
=C2=A0}
@@ -2849,10= +2853,14 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i= ,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 const struct= page *page, int off)
=C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 if (i) {
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0const struct skb_frag_struct *frag =3D &skb_shinfo(skb)->frags= [i - 1];
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= struct skb_frag_struct *frag =3D &skb_shinfo(skb)->frags[i - 1];
=C2=A0
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0return page =3D=3D skb_frag_page(frag) &&
+=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0bool ret =3D page =3D=3D sk= b_frag_page(frag) &&
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0off =3D=3D frag->page_of= fset + skb_frag_size(frag);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0if (unlikely(ret))
+=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (PageSlab(= compound_head((struct page *)page)))
+=C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0frag->coals++;
+=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return ret;
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return false;
=C2= =A0}


On Fri, Jun 1, 2018 at 2:55 PM, Kees Cook <keescoo= k@chromium.org> wrote:
On Fri, Jun 1, 2018 at 1:58 PM, Matthew= Wilcox <willy@infradead.org&= gt; wrote:
> On Fri, Jun 01, 2018 at 01:49:38PM -0700, Kees Cook wrote:
>> On Fri, Jun 1, 2018 at 12:02 PM, Laura Abbott <labbott@redhat.com> wrote:
>> > (cc-ing some interested people)
>> >
>> >
>> >
>> > On 05/31/2018 05:03 PM, Anton Eidelman wrote:
>> >> Here's a rare issue I reproduce on 4.12.10 (centos co= nfig): full log
>> >> sample below.
>>
>> Thanks for digging into this! Do you have any specific reproducer = for
>> this? If so, I'd love to try a bisection, as I'm surprised= this has
>> only now surfaced: hardened usercopy was introduced in 4.8 ...
>>
>> >> An innocent process (dhcpclient) is about to receive a da= tagram, but
>> >> during skb_copy_datagram_iter() usercopy triggers a BUG i= n:
>> >> usercopy.c:check_heap_object() -> slub.c:__check_heap_= object(), because
>> >> the sk_buff fragment being copied crosses the 64-byte slu= b object boundary.
>> >>
>> >> Example __check_heap_object() context:
>> >>=C2=A0 =C2=A0 n=3D128=C2=A0 =C2=A0 << usually 128, s= ometimes 192.
>> >>=C2=A0 =C2=A0 object_size=3D64
>> >>=C2=A0 =C2=A0 s->size=3D64
>> >>=C2=A0 =C2=A0 page_address(page)=3D0xffff880233f7c000=
>> >>=C2=A0 =C2=A0 ptr=3D0xffff880233f7c540
>> >>
>> >> My take on the root cause:
>> >>=C2=A0 =C2=A0 When adding data to an skb, new data is appe= nded to the current
>> >> fragment if the new chunk immediately follows the last on= e: by simply
>> >> increasing the frag->size, skb_frag_size_add().
>> >>=C2=A0 =C2=A0 See include/linux/skbuff.h:skb_can_coal= esce() callers.
>>
>> Oooh, sneaky:
>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0retur= n page =3D=3D skb_frag_page(frag) &&
>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 off =3D=3D frag->page_offset + skb_frag_size(frag); >>
>> Originally I was thinking that slab red-zoning would get triggered=
>> too, but I see the above is checking to see if these are precisely=
>> neighboring allocations, I think.
>>
>> But then ... how does freeing actually work? I'm really not su= re how
>> this seeming layering violation could be safe in other areas?
>
> I'm confused ... I thought skb frags came from the page_frag alloc= ator,
> not the slab allocator.=C2=A0 But then why would the slab hardening tr= igger?

Well that would certainly make more sense (well, the sense abou= t
alloc/free). Having it overlap with a slab allocation, though, that's quite bad. Perhaps this is a very odd use-after-free case? I.e. freed
page got allocated to slab, and when it got copied out, usercopy found
it spanned a slub object?

[ 655.602500] usercopy: kernel memory exposure attempt detected from
ffff88022a31aa00 (kmalloc-64) (192 bytes)

This wouldn't be the first time usercopy triggered due to a memory corr= uption...

-Kees

--
Kees Cook
Pixel Security

--000000000000e771d7056d9d044a--