From mboxrd@z Thu Jan 1 00:00:00 1970 Delivery-date: Fri, 27 Sep 2024 07:13:31 +0200 Received: from metis.whiteo.stw.pengutronix.de ([2a0a:edc0:2:b01:1d::104]) by lore.white.stw.pengutronix.de with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1su3II-003HDr-0o for lore@lore.pengutronix.de; Fri, 27 Sep 2024 07:13:31 +0200 Received: from bombadil.infradead.org ([2607:7c80:54:3::133]) by metis.whiteo.stw.pengutronix.de with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1su3IH-0002Tj-G3 for lore@pengutronix.de; Fri, 27 Sep 2024 07:13:31 +0200 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=IJb1k8doS6zLw4T7scXxjFQwTFC+LpHENDTwSyLXT2M=; b=dnAXcy8TS+U2BJVugivPgkzlFF vU3gBl2a3dUb1fSr49ej4C2rj3tm+DZ5vieXsDvZ9ENJ/V4qSSCrzidOvFa72i1P6LbPhrKz/8iSu QObgM2pHy6vP/FP1fdrUGOF28NJTMQGrjyqGEmNbkH9Cs51fWgOSvR9Itbpk6IVQ5LNme6pbQgBcX qjzJcUsup8KcAZagIDRI6ArhkNXULk0gqyJkciFkhUfIDsEn4z1egxVqCztiLfVmzKzrJXyk/uhAR J+lm+NnBxjuZB/SxqCVUAd0KAZFVApwbgODKMFkS4ExuqeiNZZgDCwf+liKjDwrTQPLwJq/Pn4kY0 XziY+G8g==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1su3HE-0000000A2Uw-3bRK; Fri, 27 Sep 2024 05:12:24 +0000 Received: from metis.whiteo.stw.pengutronix.de ([2a0a:edc0:2:b01:1d::104]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1su3HA-0000000A2Tv-1MpP for barebox@lists.infradead.org; Fri, 27 Sep 2024 05:12:23 +0000 Received: from drehscheibe.grey.stw.pengutronix.de ([2a0a:edc0:0:c01:1d::a2]) by metis.whiteo.stw.pengutronix.de with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1su3H7-00023b-QA; Fri, 27 Sep 2024 07:12:17 +0200 Received: from [2a0a:edc0:2:b01:1d::c5] (helo=pty.whiteo.stw.pengutronix.de) by drehscheibe.grey.stw.pengutronix.de with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1su3H5-001qlb-MZ; Fri, 27 Sep 2024 07:12:15 +0200 Received: from mfe by pty.whiteo.stw.pengutronix.de with local (Exim 4.96) (envelope-from ) id 1su3H5-00Eyhc-1z; Fri, 27 Sep 2024 07:12:15 +0200 Date: Fri, 27 Sep 2024 07:12:15 +0200 From: Marco Felsch To: Sascha Hauer Cc: "open list:BAREBOX" , Ahmad Fatoum Message-ID: <20240927051215.67lqwfarfcdnzbsq@pengutronix.de> References: <20240926-arm-assembly-memmove-v2-0-0a3313d29a66@pengutronix.de> <20240926-arm-assembly-memmove-v2-10-0a3313d29a66@pengutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240926-arm-assembly-memmove-v2-10-0a3313d29a66@pengutronix.de> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240926_221220_713815_10B4EF3E X-CRM114-Status: GOOD ( 36.49 ) X-BeenThere: barebox@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "barebox" X-SA-Exim-Connect-IP: 2607:7c80:54:3::133 X-SA-Exim-Mail-From: barebox-bounces+lore=pengutronix.de@lists.infradead.org X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on metis.whiteo.stw.pengutronix.de X-Spam-Level: X-Spam-Status: No, score=-4.0 required=4.0 tests=AWL,BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.2 Subject: Re: [PATCH v2 10/10] ARM: add optimized memmove X-SA-Exim-Version: 4.2.1 (built Wed, 08 May 2019 21:11:16 +0000) X-SA-Exim-Scanned: Yes (on metis.whiteo.stw.pengutronix.de) Hi Sascha, On 24-09-26, Sascha Hauer wrote: > Until now there has been no assembler optimized version of memmove() for > ARM. Add this from Linux-6.10 for both ARM32 and ARM64. This also updates > memcpy() for ARM64 from Linux-6.10. out of curiosity, did you made performance measurements? Regards, Marco > Reviewed-by: Ahmad Fatoum > Signed-off-by: Sascha Hauer > --- > arch/arm/include/asm/cache.h | 8 ++ > arch/arm/include/asm/string.h | 4 +- > arch/arm/lib32/Makefile | 1 + > arch/arm/lib32/memmove.S | 206 +++++++++++++++++++++++++++++++ > arch/arm/lib64/copy_template.S | 180 --------------------------- > arch/arm/lib64/memcpy.S | 274 ++++++++++++++++++++++++++++++++++------- > arch/arm/lib64/memset.S | 18 +-- > arch/arm/lib64/string.c | 17 +++ > include/string.h | 2 + > lib/string.c | 1 - > 10 files changed, 478 insertions(+), 233 deletions(-) > > diff --git a/arch/arm/include/asm/cache.h b/arch/arm/include/asm/cache.h > index 261c30129a..dd022c1f23 100644 > --- a/arch/arm/include/asm/cache.h > +++ b/arch/arm/include/asm/cache.h > @@ -3,6 +3,13 @@ > #ifndef __ASM_CACHE_H > #define __ASM_CACHE_H > > +#ifdef CONFIG_CPU_64 > +#define L1_CACHE_SHIFT (6) > +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) > +#endif > + > +#ifndef __ASSEMBLY__ > + > void v8_invalidate_icache_all(void); > void v8_flush_dcache_all(void); > void v8_invalidate_dcache_all(void); > @@ -25,5 +32,6 @@ void arm_early_mmu_cache_invalidate(void); > void sync_caches_for_execution(void); > > #include > +#endif > > #endif > diff --git a/arch/arm/include/asm/string.h b/arch/arm/include/asm/string.h > index 2322b846b2..f79392e53d 100644 > --- a/arch/arm/include/asm/string.h > +++ b/arch/arm/include/asm/string.h > @@ -9,10 +9,12 @@ > extern void *memcpy(void *, const void *, __kernel_size_t); > #define __HAVE_ARCH_MEMSET > extern void *memset(void *, int, __kernel_size_t); > - > +#define __HAVE_ARCH_MEMMOVE > +extern void *memmove(void *, const void *, __kernel_size_t); > #endif > > extern void *__memcpy(void *, const void *, __kernel_size_t); > extern void *__memset(void *, int, __kernel_size_t); > +extern void *__memmove(void *, const void *, __kernel_size_t); > > #endif > diff --git a/arch/arm/lib32/Makefile b/arch/arm/lib32/Makefile > index 511a029062..a139a80fb8 100644 > --- a/arch/arm/lib32/Makefile > +++ b/arch/arm/lib32/Makefile > @@ -21,6 +21,7 @@ obj-y += lshrdi3.o > obj-y += runtime-offset.o > pbl-y += runtime-offset.o > obj-$(CONFIG_ARM_OPTIMZED_STRING_FUNCTIONS) += memcpy.o > +obj-$(CONFIG_ARM_OPTIMZED_STRING_FUNCTIONS) += memmove.o > obj-$(CONFIG_ARM_OPTIMZED_STRING_FUNCTIONS) += memset.o > obj-$(CONFIG_ARM_UNWIND) += unwind.o > obj-$(CONFIG_MODULES) += module.o > diff --git a/arch/arm/lib32/memmove.S b/arch/arm/lib32/memmove.S > new file mode 100644 > index 0000000000..6410554039 > --- /dev/null > +++ b/arch/arm/lib32/memmove.S > @@ -0,0 +1,206 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * linux/arch/arm/lib/memmove.S > + * > + * Author: Nicolas Pitre > + * Created: Sep 28, 2005 > + * Copyright: (C) MontaVista Software Inc. > + */ > + > +#include > +#include > +#include > + > + .text > + > +/* > + * Prototype: void *memmove(void *dest, const void *src, size_t n); > + * > + * Note: > + * > + * If the memory regions don't overlap, we simply branch to memcpy which is > + * normally a bit faster. Otherwise the copy is done going downwards. This > + * is a transposition of the code from copy_template.S but with the copy > + * occurring in the opposite direction. > + */ > + > +ENTRY(__memmove) > +WEAK(memmove) > + UNWIND( .fnstart ) > + > + subs ip, r0, r1 > + cmphi r2, ip > + bls __memcpy > + UNWIND( .fnend ) > + > + UNWIND( .fnstart ) > + UNWIND( .save {r0, r4, fpreg, lr} ) > + stmfd sp!, {r0, r4, UNWIND(fpreg,) lr} > + UNWIND( .setfp fpreg, sp ) > + UNWIND( mov fpreg, sp ) > + add r1, r1, r2 > + add r0, r0, r2 > + subs r2, r2, #4 > + blt 8f > + ands ip, r0, #3 > + PLD( pld [r1, #-4] ) > + bne 9f > + ands ip, r1, #3 > + bne 10f > + > +1: subs r2, r2, #(28) > + stmfd sp!, {r5, r6, r8, r9} > + blt 5f > + > + CALGN( ands ip, r0, #31 ) > + CALGN( sbcsne r4, ip, r2 ) @ C is always set here > + CALGN( bcs 2f ) > + CALGN( adr r4, 6f ) > + CALGN( subs r2, r2, ip ) @ C is set here > + CALGN( rsb ip, ip, #32 ) > + CALGN( add pc, r4, ip ) > + > + PLD( pld [r1, #-4] ) > +2: PLD( subs r2, r2, #96 ) > + PLD( pld [r1, #-32] ) > + PLD( blt 4f ) > + PLD( pld [r1, #-64] ) > + PLD( pld [r1, #-96] ) > + > +3: PLD( pld [r1, #-128] ) > +4: ldmdb r1!, {r3, r4, r5, r6, r8, r9, ip, lr} > + subs r2, r2, #32 > + stmdb r0!, {r3, r4, r5, r6, r8, r9, ip, lr} > + bge 3b > + PLD( cmn r2, #96 ) > + PLD( bge 4b ) > + > +5: ands ip, r2, #28 > + rsb ip, ip, #32 > + addne pc, pc, ip @ C is always clear here > + b 7f > +6: W(nop) > + W(ldr) r3, [r1, #-4]! > + W(ldr) r4, [r1, #-4]! > + W(ldr) r5, [r1, #-4]! > + W(ldr) r6, [r1, #-4]! > + W(ldr) r8, [r1, #-4]! > + W(ldr) r9, [r1, #-4]! > + W(ldr) lr, [r1, #-4]! > + > + add pc, pc, ip > + nop > + W(nop) > + W(str) r3, [r0, #-4]! > + W(str) r4, [r0, #-4]! > + W(str) r5, [r0, #-4]! > + W(str) r6, [r0, #-4]! > + W(str) r8, [r0, #-4]! > + W(str) r9, [r0, #-4]! > + W(str) lr, [r0, #-4]! > + > + CALGN( bcs 2b ) > + > +7: ldmfd sp!, {r5, r6, r8, r9} > + > +8: movs r2, r2, lsl #31 > + ldrbne r3, [r1, #-1]! > + ldrbcs r4, [r1, #-1]! > + ldrbcs ip, [r1, #-1] > + strbne r3, [r0, #-1]! > + strbcs r4, [r0, #-1]! > + strbcs ip, [r0, #-1] > + ldmfd sp!, {r0, r4, UNWIND(fpreg,) pc} > + > +9: cmp ip, #2 > + ldrbgt r3, [r1, #-1]! > + ldrbge r4, [r1, #-1]! > + ldrb lr, [r1, #-1]! > + strbgt r3, [r0, #-1]! > + strbge r4, [r0, #-1]! > + subs r2, r2, ip > + strb lr, [r0, #-1]! > + blt 8b > + ands ip, r1, #3 > + beq 1b > + > +10: bic r1, r1, #3 > + cmp ip, #2 > + ldr r3, [r1, #0] > + beq 17f > + blt 18f > + > + > + .macro backward_copy_shift push pull > + > + subs r2, r2, #28 > + blt 14f > + > + CALGN( ands ip, r0, #31 ) > + CALGN( sbcsne r4, ip, r2 ) @ C is always set here > + CALGN( subcc r2, r2, ip ) > + CALGN( bcc 15f ) > + > +11: stmfd sp!, {r5, r6, r8 - r10} > + > + PLD( pld [r1, #-4] ) > + PLD( subs r2, r2, #96 ) > + PLD( pld [r1, #-32] ) > + PLD( blt 13f ) > + PLD( pld [r1, #-64] ) > + PLD( pld [r1, #-96] ) > + > +12: PLD( pld [r1, #-128] ) > +13: ldmdb r1!, {r8, r9, r10, ip} > + mov lr, r3, lspush #\push > + subs r2, r2, #32 > + ldmdb r1!, {r3, r4, r5, r6} > + orr lr, lr, ip, lspull #\pull > + mov ip, ip, lspush #\push > + orr ip, ip, r10, lspull #\pull > + mov r10, r10, lspush #\push > + orr r10, r10, r9, lspull #\pull > + mov r9, r9, lspush #\push > + orr r9, r9, r8, lspull #\pull > + mov r8, r8, lspush #\push > + orr r8, r8, r6, lspull #\pull > + mov r6, r6, lspush #\push > + orr r6, r6, r5, lspull #\pull > + mov r5, r5, lspush #\push > + orr r5, r5, r4, lspull #\pull > + mov r4, r4, lspush #\push > + orr r4, r4, r3, lspull #\pull > + stmdb r0!, {r4 - r6, r8 - r10, ip, lr} > + bge 12b > + PLD( cmn r2, #96 ) > + PLD( bge 13b ) > + > + ldmfd sp!, {r5, r6, r8 - r10} > + > +14: ands ip, r2, #28 > + beq 16f > + > +15: mov lr, r3, lspush #\push > + ldr r3, [r1, #-4]! > + subs ip, ip, #4 > + orr lr, lr, r3, lspull #\pull > + str lr, [r0, #-4]! > + bgt 15b > + CALGN( cmp r2, #0 ) > + CALGN( bge 11b ) > + > +16: add r1, r1, #(\pull / 8) > + b 8b > + > + .endm > + > + > + backward_copy_shift push=8 pull=24 > + > +17: backward_copy_shift push=16 pull=16 > + > +18: backward_copy_shift push=24 pull=8 > + > + UNWIND( .fnend ) > +ENDPROC(memmove) > +ENDPROC(__memmove) > diff --git a/arch/arm/lib64/copy_template.S b/arch/arm/lib64/copy_template.S > deleted file mode 100644 > index 8e4ff059d1..0000000000 > --- a/arch/arm/lib64/copy_template.S > +++ /dev/null > @@ -1,180 +0,0 @@ > -/* SPDX-License-Identifier: GPL-2.0-only */ > -/* SPDX-FileCopyrightText: 2013 ARM Ltd. */ > -/* SPDX-FileCopyrightText: 2013 Linaro */ > - > -/* > - * This code is based on glibc cortex strings work originally authored by Linaro > - * and re-licensed under GPLv2 for the Linux kernel. The original code can > - * be found @ > - * > - * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ > - * files/head:/src/aarch64/ > - */ > - > -/* > - * Copy a buffer from src to dest (alignment handled by the hardware) > - * > - * Parameters: > - * x0 - dest > - * x1 - src > - * x2 - n > - * Returns: > - * x0 - dest > - */ > -dstin .req x0 > -src .req x1 > -count .req x2 > -tmp1 .req x3 > -tmp1w .req w3 > -tmp2 .req x4 > -tmp2w .req w4 > -dst .req x6 > - > -A_l .req x7 > -A_h .req x8 > -B_l .req x9 > -B_h .req x10 > -C_l .req x11 > -C_h .req x12 > -D_l .req x13 > -D_h .req x14 > - > - mov dst, dstin > - cmp count, #16 > - /*When memory length is less than 16, the accessed are not aligned.*/ > - b.lo .Ltiny15 > - > - neg tmp2, src > - ands tmp2, tmp2, #15/* Bytes to reach alignment. */ > - b.eq .LSrcAligned > - sub count, count, tmp2 > - /* > - * Copy the leading memory data from src to dst in an increasing > - * address order.By this way,the risk of overwritting the source > - * memory data is eliminated when the distance between src and > - * dst is less than 16. The memory accesses here are alignment. > - */ > - tbz tmp2, #0, 1f > - ldrb1 tmp1w, src, #1 > - strb1 tmp1w, dst, #1 > -1: > - tbz tmp2, #1, 2f > - ldrh1 tmp1w, src, #2 > - strh1 tmp1w, dst, #2 > -2: > - tbz tmp2, #2, 3f > - ldr1 tmp1w, src, #4 > - str1 tmp1w, dst, #4 > -3: > - tbz tmp2, #3, .LSrcAligned > - ldr1 tmp1, src, #8 > - str1 tmp1, dst, #8 > - > -.LSrcAligned: > - cmp count, #64 > - b.ge .Lcpy_over64 > - /* > - * Deal with small copies quickly by dropping straight into the > - * exit block. > - */ > -.Ltail63: > - /* > - * Copy up to 48 bytes of data. At this point we only need the > - * bottom 6 bits of count to be accurate. > - */ > - ands tmp1, count, #0x30 > - b.eq .Ltiny15 > - cmp tmp1w, #0x20 > - b.eq 1f > - b.lt 2f > - ldp1 A_l, A_h, src, #16 > - stp1 A_l, A_h, dst, #16 > -1: > - ldp1 A_l, A_h, src, #16 > - stp1 A_l, A_h, dst, #16 > -2: > - ldp1 A_l, A_h, src, #16 > - stp1 A_l, A_h, dst, #16 > -.Ltiny15: > - /* > - * Prefer to break one ldp/stp into several load/store to access > - * memory in an increasing address order,rather than to load/store 16 > - * bytes from (src-16) to (dst-16) and to backward the src to aligned > - * address,which way is used in original cortex memcpy. If keeping > - * the original memcpy process here, memmove need to satisfy the > - * precondition that src address is at least 16 bytes bigger than dst > - * address,otherwise some source data will be overwritten when memove > - * call memcpy directly. To make memmove simpler and decouple the > - * memcpy's dependency on memmove, withdrew the original process. > - */ > - tbz count, #3, 1f > - ldr1 tmp1, src, #8 > - str1 tmp1, dst, #8 > -1: > - tbz count, #2, 2f > - ldr1 tmp1w, src, #4 > - str1 tmp1w, dst, #4 > -2: > - tbz count, #1, 3f > - ldrh1 tmp1w, src, #2 > - strh1 tmp1w, dst, #2 > -3: > - tbz count, #0, .Lexitfunc > - ldrb1 tmp1w, src, #1 > - strb1 tmp1w, dst, #1 > - > - b .Lexitfunc > - > -.Lcpy_over64: > - subs count, count, #128 > - b.ge .Lcpy_body_large > - /* > - * Less than 128 bytes to copy, so handle 64 here and then jump > - * to the tail. > - */ > - ldp1 A_l, A_h, src, #16 > - stp1 A_l, A_h, dst, #16 > - ldp1 B_l, B_h, src, #16 > - ldp1 C_l, C_h, src, #16 > - stp1 B_l, B_h, dst, #16 > - stp1 C_l, C_h, dst, #16 > - ldp1 D_l, D_h, src, #16 > - stp1 D_l, D_h, dst, #16 > - > - tst count, #0x3f > - b.ne .Ltail63 > - b .Lexitfunc > - > - /* > - * Critical loop. Start at a new cache line boundary. Assuming > - * 64 bytes per line this ensures the entire loop is in one line. > - */ > -.Lcpy_body_large: > - /* pre-get 64 bytes data. */ > - ldp1 A_l, A_h, src, #16 > - ldp1 B_l, B_h, src, #16 > - ldp1 C_l, C_h, src, #16 > - ldp1 D_l, D_h, src, #16 > -1: > - /* > - * interlace the load of next 64 bytes data block with store of the last > - * loaded 64 bytes data. > - */ > - stp1 A_l, A_h, dst, #16 > - ldp1 A_l, A_h, src, #16 > - stp1 B_l, B_h, dst, #16 > - ldp1 B_l, B_h, src, #16 > - stp1 C_l, C_h, dst, #16 > - ldp1 C_l, C_h, src, #16 > - stp1 D_l, D_h, dst, #16 > - ldp1 D_l, D_h, src, #16 > - subs count, count, #64 > - b.ge 1b > - stp1 A_l, A_h, dst, #16 > - stp1 B_l, B_h, dst, #16 > - stp1 C_l, C_h, dst, #16 > - stp1 D_l, D_h, dst, #16 > - > - tst count, #0x3f > - b.ne .Ltail63 > -.Lexitfunc: > diff --git a/arch/arm/lib64/memcpy.S b/arch/arm/lib64/memcpy.S > index 92845b25a6..98b453d3fd 100644 > --- a/arch/arm/lib64/memcpy.S > +++ b/arch/arm/lib64/memcpy.S > @@ -1,63 +1,249 @@ > /* SPDX-License-Identifier: GPL-2.0-only */ > -/* SPDX-FileCopyrightText: 2013 ARM Ltd. */ > -/* SPDX-FileCopyrightText: 2013 Linaro */ > - > /* > - * This code is based on glibc cortex strings work originally authored by Linaro > - * and re-licensed under GPLv2 for the Linux kernel. The original code can > - * be found @ > + * Copyright (c) 2012-2021, Arm Limited. > * > - * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ > - * files/head:/src/aarch64/ > + * Adapted from the original at: > + * https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/memcpy.S > */ > > #include > #include > > -/* > - * Copy a buffer from src to dest (alignment handled by the hardware) > +/* Assumptions: > + * > + * ARMv8-a, AArch64, unaligned accesses. > * > - * Parameters: > - * x0 - dest > - * x1 - src > - * x2 - n > - * Returns: > - * x0 - dest > */ > - .macro ldrb1 ptr, regB, val > - ldrb \ptr, [\regB], \val > - .endm > > - .macro strb1 ptr, regB, val > - strb \ptr, [\regB], \val > - .endm > +#define L(label) .L ## label > + > +#define dstin x0 > +#define src x1 > +#define count x2 > +#define dst x3 > +#define srcend x4 > +#define dstend x5 > +#define A_l x6 > +#define A_lw w6 > +#define A_h x7 > +#define B_l x8 > +#define B_lw w8 > +#define B_h x9 > +#define C_l x10 > +#define C_lw w10 > +#define C_h x11 > +#define D_l x12 > +#define D_h x13 > +#define E_l x14 > +#define E_h x15 > +#define F_l x16 > +#define F_h x17 > +#define G_l count > +#define G_h dst > +#define H_l src > +#define H_h srcend > +#define tmp1 x14 > + > +/* This implementation handles overlaps and supports both memcpy and memmove > + from a single entry point. It uses unaligned accesses and branchless > + sequences to keep the code small, simple and improve performance. > + > + Copies are split into 3 main cases: small copies of up to 32 bytes, medium > + copies of up to 128 bytes, and large copies. The overhead of the overlap > + check is negligible since it is only required for large copies. > + > + Large copies use a software pipelined loop processing 64 bytes per iteration. > + The destination pointer is 16-byte aligned to minimize unaligned accesses. > + The loop tail is handled by always copying 64 bytes from the end. > +*/ > + > +SYM_FUNC_START(__pi_memcpy) > + add srcend, src, count > + add dstend, dstin, count > + cmp count, 128 > + b.hi L(copy_long) > + cmp count, 32 > + b.hi L(copy32_128) > + > + /* Small copies: 0..32 bytes. */ > + cmp count, 16 > + b.lo L(copy16) > + ldp A_l, A_h, [src] > + ldp D_l, D_h, [srcend, -16] > + stp A_l, A_h, [dstin] > + stp D_l, D_h, [dstend, -16] > + ret > + > + /* Copy 8-15 bytes. */ > +L(copy16): > + tbz count, 3, L(copy8) > + ldr A_l, [src] > + ldr A_h, [srcend, -8] > + str A_l, [dstin] > + str A_h, [dstend, -8] > + ret > + > + .p2align 3 > + /* Copy 4-7 bytes. */ > +L(copy8): > + tbz count, 2, L(copy4) > + ldr A_lw, [src] > + ldr B_lw, [srcend, -4] > + str A_lw, [dstin] > + str B_lw, [dstend, -4] > + ret > > - .macro ldrh1 ptr, regB, val > - ldrh \ptr, [\regB], \val > - .endm > + /* Copy 0..3 bytes using a branchless sequence. */ > +L(copy4): > + cbz count, L(copy0) > + lsr tmp1, count, 1 > + ldrb A_lw, [src] > + ldrb C_lw, [srcend, -1] > + ldrb B_lw, [src, tmp1] > + strb A_lw, [dstin] > + strb B_lw, [dstin, tmp1] > + strb C_lw, [dstend, -1] > +L(copy0): > + ret > > - .macro strh1 ptr, regB, val > - strh \ptr, [\regB], \val > - .endm > + .p2align 4 > + /* Medium copies: 33..128 bytes. */ > +L(copy32_128): > + ldp A_l, A_h, [src] > + ldp B_l, B_h, [src, 16] > + ldp C_l, C_h, [srcend, -32] > + ldp D_l, D_h, [srcend, -16] > + cmp count, 64 > + b.hi L(copy128) > + stp A_l, A_h, [dstin] > + stp B_l, B_h, [dstin, 16] > + stp C_l, C_h, [dstend, -32] > + stp D_l, D_h, [dstend, -16] > + ret > > - .macro ldr1 ptr, regB, val > - ldr \ptr, [\regB], \val > - .endm > + .p2align 4 > + /* Copy 65..128 bytes. */ > +L(copy128): > + ldp E_l, E_h, [src, 32] > + ldp F_l, F_h, [src, 48] > + cmp count, 96 > + b.ls L(copy96) > + ldp G_l, G_h, [srcend, -64] > + ldp H_l, H_h, [srcend, -48] > + stp G_l, G_h, [dstend, -64] > + stp H_l, H_h, [dstend, -48] > +L(copy96): > + stp A_l, A_h, [dstin] > + stp B_l, B_h, [dstin, 16] > + stp E_l, E_h, [dstin, 32] > + stp F_l, F_h, [dstin, 48] > + stp C_l, C_h, [dstend, -32] > + stp D_l, D_h, [dstend, -16] > + ret > > - .macro str1 ptr, regB, val > - str \ptr, [\regB], \val > - .endm > + .p2align 4 > + /* Copy more than 128 bytes. */ > +L(copy_long): > + /* Use backwards copy if there is an overlap. */ > + sub tmp1, dstin, src > + cbz tmp1, L(copy0) > + cmp tmp1, count > + b.lo L(copy_long_backwards) > > - .macro ldp1 ptr, regB, regC, val > - ldp \ptr, \regB, [\regC], \val > - .endm > + /* Copy 16 bytes and then align dst to 16-byte alignment. */ > > - .macro stp1 ptr, regB, regC, val > - stp \ptr, \regB, [\regC], \val > - .endm > + ldp D_l, D_h, [src] > + and tmp1, dstin, 15 > + bic dst, dstin, 15 > + sub src, src, tmp1 > + add count, count, tmp1 /* Count is now 16 too large. */ > + ldp A_l, A_h, [src, 16] > + stp D_l, D_h, [dstin] > + ldp B_l, B_h, [src, 32] > + ldp C_l, C_h, [src, 48] > + ldp D_l, D_h, [src, 64]! > + subs count, count, 128 + 16 /* Test and readjust count. */ > + b.ls L(copy64_from_end) > > - .weak __arch_memcpy > -ENTRY(__arch_memcpy) > -#include "copy_template.S" > +L(loop64): > + stp A_l, A_h, [dst, 16] > + ldp A_l, A_h, [src, 16] > + stp B_l, B_h, [dst, 32] > + ldp B_l, B_h, [src, 32] > + stp C_l, C_h, [dst, 48] > + ldp C_l, C_h, [src, 48] > + stp D_l, D_h, [dst, 64]! > + ldp D_l, D_h, [src, 64]! > + subs count, count, 64 > + b.hi L(loop64) > + > + /* Write the last iteration and copy 64 bytes from the end. */ > +L(copy64_from_end): > + ldp E_l, E_h, [srcend, -64] > + stp A_l, A_h, [dst, 16] > + ldp A_l, A_h, [srcend, -48] > + stp B_l, B_h, [dst, 32] > + ldp B_l, B_h, [srcend, -32] > + stp C_l, C_h, [dst, 48] > + ldp C_l, C_h, [srcend, -16] > + stp D_l, D_h, [dst, 64] > + stp E_l, E_h, [dstend, -64] > + stp A_l, A_h, [dstend, -48] > + stp B_l, B_h, [dstend, -32] > + stp C_l, C_h, [dstend, -16] > ret > -ENDPROC(__arch_memcpy) > + > + .p2align 4 > + > + /* Large backwards copy for overlapping copies. > + Copy 16 bytes and then align dst to 16-byte alignment. */ > +L(copy_long_backwards): > + ldp D_l, D_h, [srcend, -16] > + and tmp1, dstend, 15 > + sub srcend, srcend, tmp1 > + sub count, count, tmp1 > + ldp A_l, A_h, [srcend, -16] > + stp D_l, D_h, [dstend, -16] > + ldp B_l, B_h, [srcend, -32] > + ldp C_l, C_h, [srcend, -48] > + ldp D_l, D_h, [srcend, -64]! > + sub dstend, dstend, tmp1 > + subs count, count, 128 > + b.ls L(copy64_from_start) > + > +L(loop64_backwards): > + stp A_l, A_h, [dstend, -16] > + ldp A_l, A_h, [srcend, -16] > + stp B_l, B_h, [dstend, -32] > + ldp B_l, B_h, [srcend, -32] > + stp C_l, C_h, [dstend, -48] > + ldp C_l, C_h, [srcend, -48] > + stp D_l, D_h, [dstend, -64]! > + ldp D_l, D_h, [srcend, -64]! > + subs count, count, 64 > + b.hi L(loop64_backwards) > + > + /* Write the last iteration and copy 64 bytes from the start. */ > +L(copy64_from_start): > + ldp G_l, G_h, [src, 48] > + stp A_l, A_h, [dstend, -16] > + ldp A_l, A_h, [src, 32] > + stp B_l, B_h, [dstend, -32] > + ldp B_l, B_h, [src, 16] > + stp C_l, C_h, [dstend, -48] > + ldp C_l, C_h, [src] > + stp D_l, D_h, [dstend, -64] > + stp G_l, G_h, [dstin, 48] > + stp A_l, A_h, [dstin, 32] > + stp B_l, B_h, [dstin, 16] > + stp C_l, C_h, [dstin] > + ret > +SYM_FUNC_END(__pi_memcpy) > + > +SYM_FUNC_ALIAS(__arch_memcpy, __pi_memcpy) > +SYM_FUNC_ALIAS_WEAK(memcpy, __memcpy) > + > +SYM_FUNC_ALIAS(__pi_memmove, __pi_memcpy) > + > +SYM_FUNC_ALIAS(__arch_memmove, __pi_memmove) > +SYM_FUNC_ALIAS_WEAK(memmove, __memmove) > diff --git a/arch/arm/lib64/memset.S b/arch/arm/lib64/memset.S > index ff201750f1..f059203983 100644 > --- a/arch/arm/lib64/memset.S > +++ b/arch/arm/lib64/memset.S > @@ -1,10 +1,9 @@ > /* SPDX-License-Identifier: GPL-2.0-only */ > -/* SPDX-FileCopyrightText: 2013 ARM Ltd. */ > -/* SPDX-FileCopyrightText: 2013 Linaro */ > - > /* > + * Copyright (C) 2013 ARM Ltd. > + * Copyright (C) 2013 Linaro. > + * > * This code is based on glibc cortex strings work originally authored by Linaro > - * and re-licensed under GPLv2 for the Linux kernel. The original code can > * be found @ > * > * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ > @@ -13,6 +12,7 @@ > > #include > #include > +#include > > /* > * Fill in the buffer with character c (alignment handled by the hardware) > @@ -42,8 +42,7 @@ dst .req x8 > tmp3w .req w9 > tmp3 .req x9 > > - .weak memset > -ENTRY(__arch_memset) > +SYM_FUNC_START(__pi_memset) > mov dst, dstin /* Preserve return value. */ > and A_lw, val, #255 > orr A_lw, A_lw, A_lw, lsl #8 > @@ -115,6 +114,7 @@ ENTRY(__arch_memset) > * Critical loop. Start at a new cache line boundary. Assuming > * 64 bytes per line, this ensures the entire loop is in one line. > */ > + .p2align L1_CACHE_SHIFT > .Lnot_short: > sub dst, dst, #16/* Pre-bias. */ > sub count, count, #64 > @@ -201,4 +201,8 @@ ENTRY(__arch_memset) > ands count, count, zva_bits_x > b.ne .Ltail_maybe_long > ret > -ENDPROC(__arch_memset) > +SYM_FUNC_END(__pi_memset) > + > +SYM_FUNC_ALIAS(__arch_memset, __pi_memset) > + > +SYM_FUNC_ALIAS_WEAK(memset, __pi_memset) > diff --git a/arch/arm/lib64/string.c b/arch/arm/lib64/string.c > index 938790e1a9..c7954d6efe 100644 > --- a/arch/arm/lib64/string.c > +++ b/arch/arm/lib64/string.c > @@ -6,6 +6,7 @@ > > void *__arch_memset(void *dst, int c, __kernel_size_t size); > void *__arch_memcpy(void * dest, const void *src, size_t count); > +void *__arch_memmove(void * dest, const void *src, size_t count); > > static __prereloc void *_memset(void *dst, int c, __kernel_size_t size) > { > @@ -38,3 +39,19 @@ void __weak *memcpy(void * dest, const void *src, size_t count) > > void *__memcpy(void * dest, const void *src, size_t count) > __alias(_memcpy); > + > +static void *_memmove(void * dest, const void *src, size_t count) > +{ > + if (likely(get_cr() & CR_M)) > + return __arch_memmove(dest, src, count); > + > + return __default_memmove(dest, src, count); > +} > + > +void __weak *memmove(void * dest, const void *src, size_t count) > +{ > + return _memmove(dest, src, count); > +} > + > +void *__memmove(void * dest, const void *src, size_t count) > + __alias(_memmove); > diff --git a/include/string.h b/include/string.h > index cbe6eddf7f..986ccd83dd 100644 > --- a/include/string.h > +++ b/include/string.h > @@ -17,6 +17,8 @@ void *__nokasan_default_memset(void *, int, __kernel_size_t); > void *__default_memcpy(void * dest,const void *src,size_t count); > void *__nokasan_default_memcpy(void * dest,const void *src,size_t count); > > +void *__default_memmove(void * dest,const void *src,size_t count); > + > char *parse_assignment(char *str); > > int strverscmp(const char *a, const char *b); > diff --git a/lib/string.c b/lib/string.c > index 98dd3cffdd..50c2016c2b 100644 > --- a/lib/string.c > +++ b/lib/string.c > @@ -701,7 +701,6 @@ void *memmove(void * dest, const void *src, size_t count) > void *__memmove(void * dest, const void *src, size_t count) > __alias(__default_memmove); > #endif > -EXPORT_SYMBOL(memmove); > > #ifndef __HAVE_ARCH_MEMCMP > /** > > -- > 2.39.5 > > >