From patchwork Sat Jun 7 12:39:25 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gyorgy Sarvari X-Patchwork-Id: 64500 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26364C5B543 for ; Sat, 7 Jun 2025 12:39:31 +0000 (UTC) Received: from mail-ej1-f48.google.com (mail-ej1-f48.google.com [209.85.218.48]) by mx.groups.io with SMTP id smtpd.web11.17546.1749299969212553863 for ; Sat, 07 Jun 2025 05:39:29 -0700 Authentication-Results: mx.groups.io; dkim=pass header.i=@gmail.com header.s=20230601 header.b=bf+QI6co; spf=pass (domain: gmail.com, ip: 209.85.218.48, mailfrom: skandigraun@gmail.com) Received: by mail-ej1-f48.google.com with SMTP id a640c23a62f3a-ad572ba1347so419806066b.1 for ; Sat, 07 Jun 2025 05:39:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749299967; x=1749904767; darn=lists.openembedded.org; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:from:to:cc:subject:date:message-id:reply-to; bh=nrjSaF5bTPX2C9r63KAvjYiAD1R7wNbnfAf3s6BLPB0=; b=bf+QI6cou4qHk9EIuiHGjfaKIgkRYhQFUxoY2E912hi0KAn/CQGGDamtNt2ZfCUt9n onis7u+K/m4GJ0wuvTsWIsyqA2X6RmECdAw6e/F6G/v3TsY7EKazVx+SzlOy9kyut7IC pVyD/dzfO+6yPNxz+d8fE4KkaqrSiGlVw1umKC1TCRBsvPrHwxceQz/D2/E665to//KE VgugS9M1vBy4LJgwL0IQGUpmsoEETpD1MGWuVvNQy7CGaKCwAc4zjPW0kkYxmfuYLX9i Nycoa687E9Sg5A0fP5rHsZBIO6H2aPGUpFPAY5sOm+aEDBglQT4P52pqyohF5an99yTf hWSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749299967; x=1749904767; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=nrjSaF5bTPX2C9r63KAvjYiAD1R7wNbnfAf3s6BLPB0=; b=GOZLBlkYQrYbkrkUQ2OLfpHk5l/T/nFDznUxOlw6mDFKnCRqT/BH6R732B70OZ7uGt 21tQ8rgorVLUvScNtxdxKn63YRSe6cIlWflWoE4RkWtQFze6ZVOVEJsnGmYE1crEVGYV hShRrw5fRGK6qIzgfAQc2JQI8bFUBtMM+/HM/Tv3wMeFTU2qk/H1ssPUxzpuC9zgz3rI o6xAAjkipsKXKjTkRDsTnzn2taP4FzXhMrCKRhGGif7OBZ3iybsL1+pzvO2kJUOFFrQV LvYAwLiWQlcFyVf+Z90EYdG70c1W+L65vOapdf09Dy1i92zms7oLNLE07lHgRW0583eE p9Yw== X-Gm-Message-State: AOJu0YwzsHALfJMktCUhGgFKDw4lRuYv8CVkdtJmm6kwzyXJKdMcdTcy PN5yODMVuuGquRPkCoTCwmJRXPewMCU5O+EykBHhFn0WwwyvS3hHX4AeIYGS8w== X-Gm-Gg: ASbGnctSLg37WR/xRjKbnun59X0wmttp+ikY2tK4/AiD9XWWOMgTcYKVIZglhcKztvj LXQVGG7YA8v1w4k620q3TQFGH1feMrIStIgGA4FkWrKAVkWFMqAmpzqmQorzQptMO6x/zGraAi+ 619uAh667tvxNWp6GyZvLYVMHuAF7ulj+51Aw3IxvHuaUGOiLg3jnZREfQqozboVBSdax3zV/tn hy42AjxOlARtpcS1UW5ypEXjMV/OXq1Gzg5Jt9eN8QpBHljVLZF3Azyn3Cem2dzGoLb8jYfqWhD er9OsyXewUPQOQYrOdjOFiQynC1qgqIgfKU+CaNsuz1sLxOUqhWJfjoU4kgMjYXGBZMhfKlCLQw Pig== X-Google-Smtp-Source: AGHT+IG+54D8D5kMxWgRVxzffEN9xfZM4UNuu0eei1qT3F5BJKZlnYdPvA3fURK5y3g7/IzDodPTGQ== X-Received: by 2002:a17:907:c22:b0:add:ee2c:7313 with SMTP id a640c23a62f3a-ade1a9074e8mr613501366b.22.1749299966763; Sat, 07 Jun 2025 05:39:26 -0700 (PDT) Received: from localhost.localdomain ([51.154.145.205]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ade1d7542bcsm270410066b.32.2025.06.07.05.39.26 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 07 Jun 2025 05:39:26 -0700 (PDT) From: Gyorgy Sarvari To: openembedded-core@lists.openembedded.org Subject: [PATCH v2] libtheora: upgrade 1.1.1 -> 1.2.0 Date: Sat, 7 Jun 2025 14:39:25 +0200 Message-ID: <20250607123925.452937-1-skandigraun@gmail.com> X-Mailer: git-send-email 2.49.0 MIME-Version: 1.0 List-Id: X-Webhook-Received: from li982-79.members.linode.com [45.33.32.79] by aws-us-west-2-korg-lkml-1.web.codeaurora.org with HTTPS for ; Sat, 07 Jun 2025 12:39:31 -0000 X-Groupsio-URL: https://lists.openembedded.org/g/openembedded-core/message/218195 Drop no-docs.patch, and use "--disable-doc" configuration instead. Drop autoreconf.patch, because it is included in the release. Add 0001-add-missing-files.patch to mitigate a release issue, which caused some files to be missing from the tarball. Major changes: - New 'ptalarbvorm' encoder - New th_encode_ctl option for copying configuration from an existing setup header, useful for splicing streams. - Added support for RISC OS. - Improved ARM support. - Various speed, bug fixes and code quality improvements. See CHANGES file for full changelog. Signed-off-by: Gyorgy Sarvari --- Changes in v2: - Add a patch to mitigate a release bug: https://gitlab.xiph.org/xiph/theora/-/issues/2338 Some 32-bit arm related files are missing from the tarball, causing the build to fail on this arch (when assembly optimization is also enabled, which is the default). - Remove no-docs.patch, and instead add "--disable-doc" configuration option, which is a new option that was introduced in this release. Link to v1: https://lore.kernel.org/openembedded-core/20250606114337.611299-1-skandigraun@gmail.com .../libtheora-1.1.1/autoreconf.patch | 42 - .../libtheora/libtheora-1.1.1/no-docs.patch | 15 - .../libtheora/0001-add-missing-files.patch | 769 ++++++++++++++++++ ...{libtheora_1.1.1.bb => libtheora_1.2.0.bb} | 9 +- 4 files changed, 773 insertions(+), 62 deletions(-) delete mode 100644 meta/recipes-multimedia/libtheora/libtheora-1.1.1/autoreconf.patch delete mode 100644 meta/recipes-multimedia/libtheora/libtheora-1.1.1/no-docs.patch create mode 100644 meta/recipes-multimedia/libtheora/libtheora/0001-add-missing-files.patch rename meta/recipes-multimedia/libtheora/{libtheora_1.1.1.bb => libtheora_1.2.0.bb} (71%) diff --git a/meta/recipes-multimedia/libtheora/libtheora-1.1.1/autoreconf.patch b/meta/recipes-multimedia/libtheora/libtheora-1.1.1/autoreconf.patch deleted file mode 100644 index 0fc09ba413..0000000000 --- a/meta/recipes-multimedia/libtheora/libtheora-1.1.1/autoreconf.patch +++ /dev/null @@ -1,42 +0,0 @@ -From 859e58b440e64aeec446ae0a923a638e4203f408 Mon Sep 17 00:00:00 2001 -From: Tim Terriberry -Date: Fri, 20 May 2011 20:41:50 +0000 -Subject: [PATCH] Make autoreconf -i -f work. - -Patch from David Schleef. - -svn path=/trunk/theora/; revision=17990 - -Upstream-Status: Backport -Signed-off-by: Ross Burton ---- - Makefile.am | 2 ++ - configure.ac | 2 +- - 2 files changed, 3 insertions(+), 1 deletion(-) - -diff --git a/Makefile.am b/Makefile.am -index d833491..369978d 100644 ---- a/Makefile.am -+++ b/Makefile.am -@@ -2,6 +2,8 @@ - - AUTOMAKE_OPTIONS = foreign 1.6 dist-zip dist-bzip2 - -+ACLOCAL_AMFLAGS=-I m4 -+ - if THEORA_ENABLE_EXAMPLES - EXAMPLES_DIR = examples - else -diff --git a/configure.ac b/configure.ac -index 8260bdf..d4feb86 100644 ---- a/configure.ac -+++ b/configure.ac -@@ -61,7 +61,7 @@ AC_LIBTOOL_WIN32_DLL - AM_PROG_LIBTOOL - - dnl Add parameters for aclocal --AC_SUBST(ACLOCAL_AMFLAGS, "-I m4") -+AC_CONFIG_MACRO_DIR([m4]) - - dnl Check for doxygen - AC_CHECK_PROG(HAVE_DOXYGEN, doxygen, true, false) diff --git a/meta/recipes-multimedia/libtheora/libtheora-1.1.1/no-docs.patch b/meta/recipes-multimedia/libtheora/libtheora-1.1.1/no-docs.patch deleted file mode 100644 index 359f3d1a7a..0000000000 --- a/meta/recipes-multimedia/libtheora/libtheora-1.1.1/no-docs.patch +++ /dev/null @@ -1,15 +0,0 @@ -Upstream-Status: Inappropriate [configuration] - -Index: libtheora-1.1.1/Makefile.am -=================================================================== ---- libtheora-1.1.1.orig/Makefile.am 2009-11-25 22:01:53.593775926 +0100 -+++ libtheora-1.1.1/Makefile.am 2009-11-25 22:02:00.777524017 +0100 -@@ -8,7 +8,7 @@ - EXAMPLES_DIR = - endif - --SUBDIRS = lib include doc tests m4 $(EXAMPLES_DIR) -+SUBDIRS = lib include tests m4 $(EXAMPLES_DIR) - - - # we include the whole debian/ dir in EXTRA_DIST because there's a problem diff --git a/meta/recipes-multimedia/libtheora/libtheora/0001-add-missing-files.patch b/meta/recipes-multimedia/libtheora/libtheora/0001-add-missing-files.patch new file mode 100644 index 0000000000..323ac7da83 --- /dev/null +++ b/meta/recipes-multimedia/libtheora/libtheora/0001-add-missing-files.patch @@ -0,0 +1,769 @@ +From 0880595f9b08d15da0e72cefaf24841cbb930883 Mon Sep 17 00:00:00 2001 +From: Gyorgy Sarvari +Date: Sat, 7 Jun 2025 14:10:40 +0200 +Subject: [PATCH] add missing files + +Due to a release issue, two files were not added to the libtheora 1.2.0 +release tarball - these files are required to be able to build the +library for 32-bit ARM systems along with assembly optimization. + +This patch adds these files. + +This is not a code issue per-se, rather a tarballing one, as the files +are present in the source code repository. + +Upstream-Status: Backport [https://gitlab.xiph.org/xiph/theora/-/issues/2338] + +Signed-off-by: Gyorgy Sarvari +--- + lib/arm/armenc.c | 57 ++++ + lib/arm/armloop.s | 676 ++++++++++++++++++++++++++++++++++++++++++++++ + 2 files changed, 733 insertions(+) + create mode 100644 lib/arm/armenc.c + create mode 100644 lib/arm/armloop.s + +diff --git a/lib/arm/armenc.c b/lib/arm/armenc.c +new file mode 100644 +index 0000000..4cfb8a7 +--- /dev/null ++++ b/lib/arm/armenc.c +@@ -0,0 +1,57 @@ ++/******************************************************************** ++ * * ++ * THIS FILE IS PART OF THE OggTheora SOFTWARE CODEC SOURCE CODE. * ++ * USE, DISTRIBUTION AND REPRODUCTION OF THIS LIBRARY SOURCE IS * ++ * GOVERNED BY A BSD-STYLE SOURCE LICENSE INCLUDED WITH THIS SOURCE * ++ * IN 'COPYING'. PLEASE READ THESE TERMS BEFORE DISTRIBUTING. * ++ * * ++ * THE Theora SOURCE CODE IS COPYRIGHT (C) 2002-2010 * ++ * by the Xiph.Org Foundation and contributors * ++ * https://www.xiph.org/ * ++ * * ++ ******************************************************************** ++ ++ function: ++ ++ ********************************************************************/ ++#include "armenc.h" ++ ++#if defined(OC_ARM_ASM) ++ ++void oc_enc_accel_init_arm(oc_enc_ctx *_enc){ ++ ogg_uint32_t cpu_flags; ++ cpu_flags=_enc->state.cpu_flags; ++ oc_enc_accel_init_c(_enc); ++# if defined(OC_ENC_USE_VTABLE) ++ /*TODO: Add ARMv4 functions here.*/ ++# endif ++# if defined(OC_ARM_ASM_EDSP) ++ if(cpu_flags&OC_CPU_ARM_EDSP){ ++# if defined(OC_STATE_USE_VTABLE) ++ /*TODO: Add EDSP functions here.*/ ++# endif ++ } ++# if defined(OC_ARM_ASM_MEDIA) ++ if(cpu_flags&OC_CPU_ARM_MEDIA){ ++# if defined(OC_STATE_USE_VTABLE) ++ /*TODO: Add Media functions here.*/ ++# endif ++ } ++# if defined(OC_ARM_ASM_NEON) ++ if(cpu_flags&OC_CPU_ARM_NEON){ ++# if defined(OC_STATE_USE_VTABLE) ++ _enc->opt_vtable.frag_satd=oc_enc_frag_satd_neon; ++ _enc->opt_vtable.frag_satd2=oc_enc_frag_satd2_neon; ++ _enc->opt_vtable.frag_intra_satd=oc_enc_frag_intra_satd_neon; ++ _enc->opt_vtable.enquant_table_init=oc_enc_enquant_table_init_neon; ++ _enc->opt_vtable.enquant_table_fixup=oc_enc_enquant_table_fixup_neon; ++ _enc->opt_vtable.quantize=oc_enc_quantize_neon; ++# endif ++ _enc->opt_data.enquant_table_size=128*sizeof(ogg_uint16_t); ++ _enc->opt_data.enquant_table_alignment=16; ++ } ++# endif ++# endif ++# endif ++} ++#endif +diff --git a/lib/arm/armloop.s b/lib/arm/armloop.s +new file mode 100644 +index 0000000..c35da0f +--- /dev/null ++++ b/lib/arm/armloop.s +@@ -0,0 +1,676 @@ ++;******************************************************************** ++;* * ++;* THIS FILE IS PART OF THE OggTheora SOFTWARE CODEC SOURCE CODE. * ++;* USE, DISTRIBUTION AND REPRODUCTION OF THIS LIBRARY SOURCE IS * ++;* GOVERNED BY A BSD-STYLE SOURCE LICENSE INCLUDED WITH THIS SOURCE * ++;* IN 'COPYING'. PLEASE READ THESE TERMS BEFORE DISTRIBUTING. * ++;* * ++;* THE Theora SOURCE CODE IS COPYRIGHT (C) 2002-2010 * ++;* by the Xiph.Org Foundation and contributors * ++;* https://www.xiph.org/ * ++;* * ++;******************************************************************** ++; Original implementation: ++; Copyright (C) 2009 Robin Watts for Pinknoise Productions Ltd ++;******************************************************************** ++ ++ AREA |.text|, CODE, READONLY ++ ++ GET armopts.s ++ ++ EXPORT oc_loop_filter_frag_rows_arm ++ ++; Which bit this is depends on the order of packing within a bitfield. ++; Hopefully that doesn't change among any of the relevant compilers. ++OC_FRAG_CODED_FLAG * 1 ++ ++ ; Vanilla ARM v4 version ++loop_filter_h_arm PROC ++ ; r0 = unsigned char *_pix ++ ; r1 = int _ystride ++ ; r2 = int *_bv ++ ; preserves r0-r3 ++ STMFD r13!,{r3-r6,r14} ++ MOV r14,#8 ++ MOV r6, #255 ++lfh_arm_lp ++ LDRB r3, [r0, #-2] ; r3 = _pix[0] ++ LDRB r12,[r0, #1] ; r12= _pix[3] ++ LDRB r4, [r0, #-1] ; r4 = _pix[1] ++ LDRB r5, [r0] ; r5 = _pix[2] ++ SUB r3, r3, r12 ; r3 = _pix[0]-_pix[3]+4 ++ ADD r3, r3, #4 ++ SUB r12,r5, r4 ; r12= _pix[2]-_pix[1] ++ ADD r12,r12,r12,LSL #1 ; r12= 3*(_pix[2]-_pix[1]) ++ ADD r12,r12,r3 ; r12= _pix[0]-_pix[3]+3*(_pix[2]-_pix[1])+4 ++ MOV r12,r12,ASR #3 ++ LDRSB r12,[r2, r12] ++ ; Stall (2 on Xscale) ++ ADDS r4, r4, r12 ++ CMPGT r6, r4 ++ EORLT r4, r6, r4, ASR #32 ++ SUBS r5, r5, r12 ++ CMPGT r6, r5 ++ EORLT r5, r6, r5, ASR #32 ++ STRB r4, [r0, #-1] ++ STRB r5, [r0], r1 ++ SUBS r14,r14,#1 ++ BGT lfh_arm_lp ++ SUB r0, r0, r1, LSL #3 ++ LDMFD r13!,{r3-r6,PC} ++ ENDP ++ ++loop_filter_v_arm PROC ++ ; r0 = unsigned char *_pix ++ ; r1 = int _ystride ++ ; r2 = int *_bv ++ ; preserves r0-r3 ++ STMFD r13!,{r3-r6,r14} ++ MOV r14,#8 ++ MOV r6, #255 ++lfv_arm_lp ++ LDRB r3, [r0, -r1, LSL #1] ; r3 = _pix[0] ++ LDRB r12,[r0, r1] ; r12= _pix[3] ++ LDRB r4, [r0, -r1] ; r4 = _pix[1] ++ LDRB r5, [r0] ; r5 = _pix[2] ++ SUB r3, r3, r12 ; r3 = _pix[0]-_pix[3]+4 ++ ADD r3, r3, #4 ++ SUB r12,r5, r4 ; r12= _pix[2]-_pix[1] ++ ADD r12,r12,r12,LSL #1 ; r12= 3*(_pix[2]-_pix[1]) ++ ADD r12,r12,r3 ; r12= _pix[0]-_pix[3]+3*(_pix[2]-_pix[1])+4 ++ MOV r12,r12,ASR #3 ++ LDRSB r12,[r2, r12] ++ ; Stall (2 on Xscale) ++ ADDS r4, r4, r12 ++ CMPGT r6, r4 ++ EORLT r4, r6, r4, ASR #32 ++ SUBS r5, r5, r12 ++ CMPGT r6, r5 ++ EORLT r5, r6, r5, ASR #32 ++ STRB r4, [r0, -r1] ++ STRB r5, [r0], #1 ++ SUBS r14,r14,#1 ++ BGT lfv_arm_lp ++ SUB r0, r0, #8 ++ LDMFD r13!,{r3-r6,PC} ++ ENDP ++ ++oc_loop_filter_frag_rows_arm PROC ++ ; r0 = _ref_frame_data ++ ; r1 = _ystride ++ ; r2 = _bv ++ ; r3 = _frags ++ ; r4 = _fragi0 ++ ; r5 = _fragi0_end ++ ; r6 = _fragi_top ++ ; r7 = _fragi_bot ++ ; r8 = _frag_buf_offs ++ ; r9 = _nhfrags ++ MOV r12,r13 ++ STMFD r13!,{r0,r4-r11,r14} ++ LDMFD r12,{r4-r9} ++ ADD r2, r2, #127 ; _bv += 127 ++ CMP r4, r5 ; if(_fragi0>=_fragi0_end) ++ BGE oslffri_arm_end ; bail ++ SUBS r9, r9, #1 ; r9 = _nhfrags-1 if (r9<=0) ++ BLE oslffri_arm_end ; bail ++ ADD r3, r3, r4, LSL #2 ; r3 = &_frags[fragi] ++ ADD r8, r8, r4, LSL #2 ; r8 = &_frag_buf_offs[fragi] ++ SUB r7, r7, r9 ; _fragi_bot -= _nhfrags; ++oslffri_arm_lp1 ++ MOV r10,r4 ; r10= fragi = _fragi0 ++ ADD r11,r4, r9 ; r11= fragi_end-1=fragi+_nhfrags-1 ++oslffri_arm_lp2 ++ LDR r14,[r3], #4 ; r14= _frags[fragi] _frags++ ++ LDR r0, [r13] ; r0 = _ref_frame_data ++ LDR r12,[r8], #4 ; r12= _frag_buf_offs[fragi] _frag_buf_offs++ ++ TST r14,#OC_FRAG_CODED_FLAG ++ BEQ oslffri_arm_uncoded ++ CMP r10,r4 ; if (fragi>_fragi0) ++ ADD r0, r0, r12 ; r0 = _ref_frame_data + _frag_buf_offs[fragi] ++ BLGT loop_filter_h_arm ++ CMP r4, r6 ; if (_fragi0>_fragi_top) ++ BLGT loop_filter_v_arm ++ CMP r10,r11 ; if(fragi+1 ++ AND r1, r1, #255 ; r1 = ll=r1&0xFF ++ ORR r1, r1, r1, LSL #8 ; r1 = ++ PKHBT r1, r1, r1, LSL #16 ; r1 = ++ STR r1, [r0] ++ MOV PC,r14 ++ ENDP ++ ++; We could use the same strategy as the v filter below, but that would require ++; 40 instructions to load the data and transpose it into columns and another ++; 32 to write out the results at the end, plus the 52 instructions to do the ++; filtering itself. ++; This is slightly less, and less code, even assuming we could have shared the ++; 52 instructions in the middle with the other function. ++; It executes slightly fewer instructions than the ARMv6 approach David Conrad ++; proposed for FFmpeg, but not by much: ++; http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2010-February/083141.html ++; His is a lot less code, though, because it only does two rows at once instead ++; of four. ++loop_filter_h_v6 PROC ++ ; r0 = unsigned char *_pix ++ ; r1 = int _ystride ++ ; r2 = int _ll ++ ; preserves r0-r3 ++ STMFD r13!,{r4-r11,r14} ++ LDR r12,=0x10003 ++ BL loop_filter_h_core_v6 ++ ADD r0, r0, r1, LSL #2 ++ BL loop_filter_h_core_v6 ++ SUB r0, r0, r1, LSL #2 ++ LDMFD r13!,{r4-r11,PC} ++ ENDP ++ ++loop_filter_h_core_v6 PROC ++ ; r0 = unsigned char *_pix ++ ; r1 = int _ystride ++ ; r2 = int _ll ++ ; r12= 0x10003 ++ ; Preserves r0-r3, r12; Clobbers r4-r11. ++ LDR r4,[r0, #-2]! ; r4 = ++ ; Single issue ++ LDR r5,[r0, r1]! ; r5 = ++ UXTB16 r6, r4, ROR #16 ; r6 = ++ UXTB16 r4, r4, ROR #8 ; r4 = ++ UXTB16 r7, r5, ROR #16 ; r7 = ++ UXTB16 r5, r5, ROR #8 ; r5 = ++ PKHBT r8, r4, r5, LSL #16 ; r8 = <__|q1|__|p1> ++ PKHBT r9, r6, r7, LSL #16 ; r9 = <__|q2|__|p2> ++ SSUB16 r6, r4, r6 ; r6 = ++ SMLAD r6, r6, r12,r12 ; r6 = ++ SSUB16 r7, r5, r7 ; r7 = ++ SMLAD r7, r7, r12,r12 ; r7 = ++ LDR r4,[r0, r1]! ; r4 = ++ MOV r6, r6, ASR #3 ; r6 = >3> ++ LDR r5,[r0, r1]! ; r5 = ++ PKHBT r11,r6, r7, LSL #13 ; r11= ++ UXTB16 r6, r4, ROR #16 ; r6 = ++ UXTB16 r11,r11 ; r11= <__|-R_q|__|-R_p> ++ UXTB16 r4, r4, ROR #8 ; r4 = ++ UXTB16 r7, r5, ROR #16 ; r7 = ++ PKHBT r10,r6, r7, LSL #16 ; r10= <__|s2|__|r2> ++ SSUB16 r6, r4, r6 ; r6 = ++ UXTB16 r5, r5, ROR #8 ; r5 = ++ SMLAD r6, r6, r12,r12 ; r6 = ++ SSUB16 r7, r5, r7 ; r7 = ++ SMLAD r7, r7, r12,r12 ; r7 = ++ ORR r9, r9, r10, LSL #8 ; r9 = ++ MOV r6, r6, ASR #3 ; r6 = >3> ++ PKHBT r10,r4, r5, LSL #16 ; r10= <__|s1|__|r1> ++ PKHBT r6, r6, r7, LSL #13 ; r6 = ++ ORR r8, r8, r10, LSL #8 ; r8 = ++ UXTB16 r6, r6 ; r6 = <__|-R_s|__|-R_r> ++ MOV r10,#0 ++ ORR r6, r11,r6, LSL #8 ; r6 = <-R_s|-R_q|-R_r|-R_p> ++ ; Single issue ++ ; There's no min, max or abs instruction. ++ ; SSUB8 and SEL will work for abs, and we can do all the rest with ++ ; unsigned saturated adds, which means the GE flags are still all ++ ; set when we're done computing lflim(abs(R_i),L). ++ ; This allows us to both add and subtract, and split the results by ++ ; the original sign of R_i. ++ SSUB8 r7, r10,r6 ++ ; Single issue ++ SEL r7, r7, r6 ; r7 = abs(R_i) ++ ; Single issue ++ UQADD8 r4, r7, r2 ; r4 = 255-max(2*L-abs(R_i),0) ++ ; Single issue ++ UQADD8 r7, r7, r4 ++ ; Single issue ++ UQSUB8 r7, r7, r4 ; r7 = min(abs(R_i),max(2*L-abs(R_i),0)) ++ ; Single issue ++ UQSUB8 r4, r8, r7 ++ UQADD8 r5, r9, r7 ++ UQADD8 r8, r8, r7 ++ UQSUB8 r9, r9, r7 ++ SEL r8, r8, r4 ; r8 = p1+lflim(R_i,L) ++ SEL r9, r9, r5 ; r9 = p2-lflim(R_i,L) ++ MOV r5, r9, LSR #24 ; r5 = s2 ++ STRB r5, [r0,#2]! ++ MOV r4, r8, LSR #24 ; r4 = s1 ++ STRB r4, [r0,#-1] ++ MOV r5, r9, LSR #8 ; r5 = r2 ++ STRB r5, [r0,-r1]! ++ MOV r4, r8, LSR #8 ; r4 = r1 ++ STRB r4, [r0,#-1] ++ MOV r5, r9, LSR #16 ; r5 = q2 ++ STRB r5, [r0,-r1]! ++ MOV r4, r8, LSR #16 ; r4 = q1 ++ STRB r4, [r0,#-1] ++ ; Single issue ++ STRB r9, [r0,-r1]! ++ ; Single issue ++ STRB r8, [r0,#-1] ++ MOV PC,r14 ++ ENDP ++ ++; This uses the same strategy as the MMXEXT version for x86, except that UHADD8 ++; computes (a+b>>1) instead of (a+b+1>>1) like PAVGB. ++; This works just as well, with the following procedure for computing the ++; filter value, f: ++; u = ~UHADD8(p1,~p2); ++; v = UHADD8(~p1,p2); ++; m = v-u; ++; a = m^UHADD8(m^p0,m^~p3); ++; f = UHADD8(UHADD8(a,u1),v1); ++; where f = 127+R, with R in [-127,128] defined as in the spec. ++; This is exactly the same amount of arithmetic as the version that uses PAVGB ++; as the basic operator. ++; It executes about 2/3 the number of instructions of David Conrad's approach, ++; but requires more code, because it does all eight columns at once, instead ++; of four at a time. ++loop_filter_v_v6 PROC ++ ; r0 = unsigned char *_pix ++ ; r1 = int _ystride ++ ; r2 = int _ll ++ ; preserves r0-r11 ++ STMFD r13!,{r4-r11,r14} ++ LDRD r6, [r0, -r1]! ; r7, r6 = ++ LDRD r4, [r0, -r1] ; r5, r4 = ++ LDRD r8, [r0, r1]! ; r9, r8 = ++ MVN r14,r6 ; r14= ~p1 ++ LDRD r10,[r0, r1] ; r11,r10= ++ ; Filter the first four columns. ++ MVN r12,r8 ; r12= ~p2 ++ UHADD8 r14,r14,r8 ; r14= v1=~p1+p2>>1 ++ UHADD8 r12,r12,r6 ; r12= p1+~p2>>1 ++ MVN r10, r10 ; r10=~p3 ++ MVN r12,r12 ; r12= u1=~p1+p2+1>>1 ++ SSUB8 r14,r14,r12 ; r14= m1=v1-u1 ++ ; Single issue ++ EOR r4, r4, r14 ; r4 = m1^p0 ++ EOR r10,r10,r14 ; r10= m1^~p3 ++ UHADD8 r4, r4, r10 ; r4 = (m1^p0)+(m1^~p3)>>1 ++ ; Single issue ++ EOR r4, r4, r14 ; r4 = a1=m1^((m1^p0)+(m1^~p3)>>1) ++ SADD8 r14,r14,r12 ; r14= v1=m1+u1 ++ UHADD8 r4, r4, r12 ; r4 = a1+u1>>1 ++ MVN r12,r9 ; r12= ~p6 ++ UHADD8 r4, r4, r14 ; r4 = f1=(a1+u1>>1)+v1>>1 ++ ; Filter the second four columns. ++ MVN r14,r7 ; r14= ~p5 ++ UHADD8 r12,r12,r7 ; r12= p5+~p6>>1 ++ UHADD8 r14,r14,r9 ; r14= v2=~p5+p6>>1 ++ MVN r12,r12 ; r12= u2=~p5+p6+1>>1 ++ MVN r11,r11 ; r11=~p7 ++ SSUB8 r10,r14,r12 ; r10= m2=v2-u2 ++ ; Single issue ++ EOR r5, r5, r10 ; r5 = m2^p4 ++ EOR r11,r11,r10 ; r11= m2^~p7 ++ UHADD8 r5, r5, r11 ; r5 = (m2^p4)+(m2^~p7)>>1 ++ ; Single issue ++ EOR r5, r5, r10 ; r5 = a2=m2^((m2^p4)+(m2^~p7)>>1) ++ ; Single issue ++ UHADD8 r5, r5, r12 ; r5 = a2+u2>>1 ++ LDR r12,=0x7F7F7F7F ; r12 = {127}x4 ++ UHADD8 r5, r5, r14 ; r5 = f2=(a2+u2>>1)+v2>>1 ++ ; Now split f[i] by sign. ++ ; There's no min or max instruction. ++ ; We could use SSUB8 and SEL, but this is just as many instructions and ++ ; dual issues more (for v7 without NEON). ++ UQSUB8 r10,r4, r12 ; r10= R_i>0?R_i:0 ++ UQSUB8 r4, r12,r4 ; r4 = R_i<0?-R_i:0 ++ UQADD8 r11,r10,r2 ; r11= 255-max(2*L-abs(R_i<0),0) ++ UQADD8 r14,r4, r2 ; r14= 255-max(2*L-abs(R_i>0),0) ++ UQADD8 r10,r10,r11 ++ UQADD8 r4, r4, r14 ++ UQSUB8 r10,r10,r11 ; r10= min(abs(R_i<0),max(2*L-abs(R_i<0),0)) ++ UQSUB8 r4, r4, r14 ; r4 = min(abs(R_i>0),max(2*L-abs(R_i>0),0)) ++ UQSUB8 r11,r5, r12 ; r11= R_i>0?R_i:0 ++ UQADD8 r6, r6, r10 ++ UQSUB8 r8, r8, r10 ++ UQSUB8 r5, r12,r5 ; r5 = R_i<0?-R_i:0 ++ UQSUB8 r6, r6, r4 ; r6 = p1+lflim(R_i,L) ++ UQADD8 r8, r8, r4 ; r8 = p2-lflim(R_i,L) ++ UQADD8 r10,r11,r2 ; r10= 255-max(2*L-abs(R_i<0),0) ++ UQADD8 r14,r5, r2 ; r14= 255-max(2*L-abs(R_i>0),0) ++ UQADD8 r11,r11,r10 ++ UQADD8 r5, r5, r14 ++ UQSUB8 r11,r11,r10 ; r11= min(abs(R_i<0),max(2*L-abs(R_i<0),0)) ++ UQSUB8 r5, r5, r14 ; r5 = min(abs(R_i>0),max(2*L-abs(R_i>0),0)) ++ UQADD8 r7, r7, r11 ++ UQSUB8 r9, r9, r11 ++ UQSUB8 r7, r7, r5 ; r7 = p5+lflim(R_i,L) ++ STRD r6, [r0, -r1] ; [p5:p1] = [r7: r6] ++ UQADD8 r9, r9, r5 ; r9 = p6-lflim(R_i,L) ++ STRD r8, [r0] ; [p6:p2] = [r9: r8] ++ LDMFD r13!,{r4-r11,PC} ++ ENDP ++ ++oc_loop_filter_frag_rows_v6 PROC ++ ; r0 = _ref_frame_data ++ ; r1 = _ystride ++ ; r2 = _bv ++ ; r3 = _frags ++ ; r4 = _fragi0 ++ ; r5 = _fragi0_end ++ ; r6 = _fragi_top ++ ; r7 = _fragi_bot ++ ; r8 = _frag_buf_offs ++ ; r9 = _nhfrags ++ MOV r12,r13 ++ STMFD r13!,{r0,r4-r11,r14} ++ LDMFD r12,{r4-r9} ++ LDR r2, [r2] ; ll = *(int *)_bv ++ CMP r4, r5 ; if(_fragi0>=_fragi0_end) ++ BGE oslffri_v6_end ; bail ++ SUBS r9, r9, #1 ; r9 = _nhfrags-1 if (r9<=0) ++ BLE oslffri_v6_end ; bail ++ ADD r3, r3, r4, LSL #2 ; r3 = &_frags[fragi] ++ ADD r8, r8, r4, LSL #2 ; r8 = &_frag_buf_offs[fragi] ++ SUB r7, r7, r9 ; _fragi_bot -= _nhfrags; ++oslffri_v6_lp1 ++ MOV r10,r4 ; r10= fragi = _fragi0 ++ ADD r11,r4, r9 ; r11= fragi_end-1=fragi+_nhfrags-1 ++oslffri_v6_lp2 ++ LDR r14,[r3], #4 ; r14= _frags[fragi] _frags++ ++ LDR r0, [r13] ; r0 = _ref_frame_data ++ LDR r12,[r8], #4 ; r12= _frag_buf_offs[fragi] _frag_buf_offs++ ++ TST r14,#OC_FRAG_CODED_FLAG ++ BEQ oslffri_v6_uncoded ++ CMP r10,r4 ; if (fragi>_fragi0) ++ ADD r0, r0, r12 ; r0 = _ref_frame_data + _frag_buf_offs[fragi] ++ BLGT loop_filter_h_v6 ++ CMP r4, r6 ; if (fragi0>_fragi_top) ++ BLGT loop_filter_v_v6 ++ CMP r10,r11 ; if(fragi+1>3 1,4 ++ ADD r12,r12,r1, LSL #2 ++ ; We want to do ++ ; f = CLAMP(MIN(-2L-f,0), f, MAX(2L-f,0)) ++ ; = ((f >= 0) ? MIN( f ,MAX(2L- f ,0)) : MAX( f , MIN(-2L- f ,0))) ++ ; = ((f >= 0) ? MIN(|f|,MAX(2L-|f|,0)) : MAX(-|f|, MIN(-2L+|f|,0))) ++ ; = ((f >= 0) ? MIN(|f|,MAX(2L-|f|,0)) :-MIN( |f|,-MIN(-2L+|f|,0))) ++ ; = ((f >= 0) ? MIN(|f|,MAX(2L-|f|,0)) :-MIN( |f|, MAX( 2L-|f|,0))) ++ ; So we've reduced the left and right hand terms to be the same, except ++ ; for a negation. ++ ; Stall x3 ++ VABS.S16 Q9, Q0 ; Q9 = |f| in U16s 1,4 ++ PLD [r12,-r1] ++ VSHR.S16 Q0, Q0, #15 ; Q0 = -1 or 0 according to sign 1,3 ++ PLD [r12] ++ VQSUB.U16 Q10,Q15,Q9 ; Q10= MAX(2L-|f|,0) in U16s 1,4 ++ PLD [r12,r1] ++ VMOVL.U8 Q1, D2 ; Q2 = __UU__QQ__MM__II__EE__AA__66__22 2,3 ++ PLD [r12,r1,LSL #1] ++ VMIN.U16 Q9, Q10,Q9 ; Q9 = MIN(|f|,MAX(2L-|f|)) 1,4 ++ ADD r12,r12,r1, LSL #2 ++ ; Now we need to correct for the sign of f. ++ ; For negative elements of Q0, we want to subtract the appropriate ++ ; element of Q9. For positive elements we want to add them. No NEON ++ ; instruction exists to do this, so we need to negate the negative ++ ; elements, and we can then just add them. a-b = a-(1+!b) = a-1+!b ++ VADD.S16 Q9, Q9, Q0 ; 1,3 ++ PLD [r12,-r1] ++ VEOR.S16 Q9, Q9, Q0 ; Q9 = real value of f 1,3 ++ ; Bah. No VRSBW.U8 ++ ; Stall (just 1 as Q9 not needed to second pipeline stage. I think.) ++ VADDW.U8 Q2, Q9, D4 ; Q1 = xxTTxxPPxxLLxxHHxxDDxx99xx55xx11 1,3 ++ VSUB.S16 Q1, Q1, Q9 ; Q2 = xxUUxxQQxxMMxxIIxxEExxAAxx66xx22 1,3 ++ VQMOVUN.S16 D4, Q2 ; D4 = TTPPLLHHDD995511 1,1 ++ VQMOVUN.S16 D2, Q1 ; D2 = UUQQMMIIEEAA6622 1,1 ++ SUB r12,r0, #1 ++ VTRN.8 D4, D2 ; D4 = QQPPIIHHAA992211 D2 = MMLLEEDD6655 1,1 ++ VST1.16 {D4[0]}, [r12], r1 ++ VST1.16 {D2[0]}, [r12], r1 ++ VST1.16 {D4[1]}, [r12], r1 ++ VST1.16 {D2[1]}, [r12], r1 ++ VST1.16 {D4[2]}, [r12], r1 ++ VST1.16 {D2[2]}, [r12], r1 ++ VST1.16 {D4[3]}, [r12], r1 ++ VST1.16 {D2[3]}, [r12], r1 ++ MOV PC,r14 ++ ENDP ++ ++loop_filter_v_neon PROC ++ ; r0 = unsigned char *_pix ++ ; r1 = int _ystride ++ ; r2 = int *_bv ++ ; preserves r0-r3 ++ ; We assume Q15= 2*L in U16s ++ ; My best guesses at cycle counts (and latency)--vvv ++ SUB r12,r0, r1, LSL #1 ++ VLD1.64 {D0}, [r12@64], r1 ; D0 = SSOOKKGGCC884400 2,1 ++ VLD1.64 {D2}, [r12@64], r1 ; D2 = TTPPLLHHDD995511 2,1 ++ VLD1.64 {D4}, [r12@64], r1 ; D4 = UUQQMMIIEEAA6622 2,1 ++ VLD1.64 {D6}, [r12@64] ; D6 = VVRRNNJJFFBB7733 2,1 ++ VSUBL.U8 Q8, D4, D2 ; Q8 = 22 - 11 in S16s 1,3 ++ VSUBL.U8 Q0, D0, D6 ; Q0 = 00 - 33 in S16s 1,3 ++ ADD r12, #8 ++ VADD.S16 Q0, Q0, Q8 ; 1,3 ++ PLD [r12] ++ VADD.S16 Q0, Q0, Q8 ; 1,3 ++ PLD [r12,r1] ++ VADD.S16 Q0, Q0, Q8 ; Q0 = [0-3]+3*[2-1] 1,3 ++ SUB r12, r0, r1 ++ VRSHR.S16 Q0, Q0, #3 ; Q0 = f = ([0-3]+3*[2-1]+4)>>3 1,4 ++ ; We want to do ++ ; f = CLAMP(MIN(-2L-f,0), f, MAX(2L-f,0)) ++ ; = ((f >= 0) ? MIN( f ,MAX(2L- f ,0)) : MAX( f , MIN(-2L- f ,0))) ++ ; = ((f >= 0) ? MIN(|f|,MAX(2L-|f|,0)) : MAX(-|f|, MIN(-2L+|f|,0))) ++ ; = ((f >= 0) ? MIN(|f|,MAX(2L-|f|,0)) :-MIN( |f|,-MIN(-2L+|f|,0))) ++ ; = ((f >= 0) ? MIN(|f|,MAX(2L-|f|,0)) :-MIN( |f|, MAX( 2L-|f|,0))) ++ ; So we've reduced the left and right hand terms to be the same, except ++ ; for a negation. ++ ; Stall x3 ++ VABS.S16 Q9, Q0 ; Q9 = |f| in U16s 1,4 ++ VSHR.S16 Q0, Q0, #15 ; Q0 = -1 or 0 according to sign 1,3 ++ ; Stall x2 ++ VQSUB.U16 Q10,Q15,Q9 ; Q10= MAX(2L-|f|,0) in U16s 1,4 ++ VMOVL.U8 Q2, D4 ; Q2 = __UU__QQ__MM__II__EE__AA__66__22 2,3 ++ ; Stall x2 ++ VMIN.U16 Q9, Q10,Q9 ; Q9 = MIN(|f|,MAX(2L-|f|)) 1,4 ++ ; Now we need to correct for the sign of f. ++ ; For negative elements of Q0, we want to subtract the appropriate ++ ; element of Q9. For positive elements we want to add them. No NEON ++ ; instruction exists to do this, so we need to negate the negative ++ ; elements, and we can then just add them. a-b = a-(1+!b) = a-1+!b ++ ; Stall x3 ++ VADD.S16 Q9, Q9, Q0 ; 1,3 ++ ; Stall x2 ++ VEOR.S16 Q9, Q9, Q0 ; Q9 = real value of f 1,3 ++ ; Bah. No VRSBW.U8 ++ ; Stall (just 1 as Q9 not needed to second pipeline stage. I think.) ++ VADDW.U8 Q1, Q9, D2 ; Q1 = xxTTxxPPxxLLxxHHxxDDxx99xx55xx11 1,3 ++ VSUB.S16 Q2, Q2, Q9 ; Q2 = xxUUxxQQxxMMxxIIxxEExxAAxx66xx22 1,3 ++ VQMOVUN.S16 D2, Q1 ; D2 = TTPPLLHHDD995511 1,1 ++ VQMOVUN.S16 D4, Q2 ; D4 = UUQQMMIIEEAA6622 1,1 ++ VST1.64 {D2}, [r12@64], r1 ++ VST1.64 {D4}, [r12@64], r1 ++ MOV PC,r14 ++ ENDP ++ ++oc_loop_filter_frag_rows_neon PROC ++ ; r0 = _ref_frame_data ++ ; r1 = _ystride ++ ; r2 = _bv ++ ; r3 = _frags ++ ; r4 = _fragi0 ++ ; r5 = _fragi0_end ++ ; r6 = _fragi_top ++ ; r7 = _fragi_bot ++ ; r8 = _frag_buf_offs ++ ; r9 = _nhfrags ++ MOV r12,r13 ++ STMFD r13!,{r0,r4-r11,r14} ++ LDMFD r12,{r4-r9} ++ CMP r4, r5 ; if(_fragi0>=_fragi0_end) ++ BGE oslffri_neon_end; bail ++ SUBS r9, r9, #1 ; r9 = _nhfrags-1 if (r9<=0) ++ BLE oslffri_neon_end ; bail ++ VLD1.64 {D30,D31}, [r2@128] ; Q15= 2L in U16s ++ ADD r3, r3, r4, LSL #2 ; r3 = &_frags[fragi] ++ ADD r8, r8, r4, LSL #2 ; r8 = &_frag_buf_offs[fragi] ++ SUB r7, r7, r9 ; _fragi_bot -= _nhfrags; ++oslffri_neon_lp1 ++ MOV r10,r4 ; r10= fragi = _fragi0 ++ ADD r11,r4, r9 ; r11= fragi_end-1=fragi+_nhfrags-1 ++oslffri_neon_lp2 ++ LDR r14,[r3], #4 ; r14= _frags[fragi] _frags++ ++ LDR r0, [r13] ; r0 = _ref_frame_data ++ LDR r12,[r8], #4 ; r12= _frag_buf_offs[fragi] _frag_buf_offs++ ++ TST r14,#OC_FRAG_CODED_FLAG ++ BEQ oslffri_neon_uncoded ++ CMP r10,r4 ; if (fragi>_fragi0) ++ ADD r0, r0, r12 ; r0 = _ref_frame_data + _frag_buf_offs[fragi] ++ BLGT loop_filter_h_neon ++ CMP r4, r6 ; if (_fragi0>_fragi_top) ++ BLGT loop_filter_v_neon ++ CMP r10,r11 ; if(fragi+1