swresample/resample: speed up build_filter by 50%

This speeds up build_filter by ~ 50%. This gain should be pretty consistent across all architectures and platforms. Essentially, this relies on a observation that the filters have some even/odd symmetry that may be exploited during the construction of the polyphase filter bank. In particular, phases (scaled to [0, 1]) in [0.5, 1] are easily derived from [0, 0.5] and expensive reevaluation of function points are unnecessary. This requires some rather annoying even/odd bookkeeping as can be seen from the patch. I vaguely recall from signal processing theory more general symmetries allowing even greater optimization of the construction. At a high level, "even functions" correspond to 2, and one can imagine variations. Nevertheless, for the sake of some generality and because of existing filters, this is all that is being exploited. Currently, this patch relies on phase_count being even or (trivially) 1, though this is not an inherent limitation to the approach. This assumption is safe as phase_count is 1 << phase_bits, and is hence a power of two. There is no way for user API to set it to a nontrivial odd number. This assumption has been placed as an assert in the code. To repeat, this assumes even symmetry of the filters, which is the most common way to get generalized linear phase anyway and is true of all currently supported filters. As a side note, accuracy should be identical or perhaps slightly better due to this "forcing" filter symmetries leading to a better phase characteristic. As before, I can't test this claim easily, though it may be of interest. Patch tested with FATE. Sample benchmark (x86-64, Haswell, GNU/Linux): test: swr-resample-dblp-44100-2626 new: 527376779 decicycles in build_filter(loop 1000), 256 runs, 0 skips 524361765 decicycles in build_filter(loop 1000), 512 runs, 0 skips 516552574 decicycles in build_filter(loop 1000), 1024 runs, 0 skips old: 974178658 decicycles in build_filter(loop 1000), 256 runs, 0 skips 972794408 decicycles in build_filter(loop 1000), 512 runs, 0 skips 954350046 decicycles in build_filter(loop 1000), 1024 runs, 0 skips Note that lower level optimizations are entirely possible, I focussed on getting the high level semantics correct. In any case, this should provide a good foundation. Reviewed-by: Michael Niedermayer <michael@niedermayer.cc> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>

swresample/resample: speed up build_filter by 50%
This speeds up build_filter by ~ 50%. This gain should be pretty consistent across all architectures and platforms. Essentially, this relies on a observation that the filters have some even/odd symmetry that may be exploited during the construction of the polyphase filter bank. In particular, phases (scaled to [0, 1]) in [0.5, 1] are easily derived from [0, 0.5] and expensive reevaluation of function points are unnecessary. This requires some rather annoying even/odd bookkeeping as can be seen from the patch. I vaguely recall from signal processing theory more general symmetries allowing even greater optimization of the construction. At a high level, "even functions" correspond to 2, and one can imagine variations. Nevertheless, for the sake of some generality and because of existing filters, this is all that is being exploited. Currently, this patch relies on phase_count being even or (trivially) 1, though this is not an inherent limitation to the approach. This assumption is safe as phase_count is 1 << phase_bits, and is hence a power of two. There is no way for user API to set it to a nontrivial odd number. This assumption has been placed as an assert in the code. To repeat, this assumes even symmetry of the filters, which is the most common way to get generalized linear phase anyway and is true of all currently supported filters. As a side note, accuracy should be identical or perhaps slightly better due to this "forcing" filter symmetries leading to a better phase characteristic. As before, I can't test this claim easily, though it may be of interest. Patch tested with FATE. Sample benchmark (x86-64, Haswell, GNU/Linux): test: swr-resample-dblp-44100-2626 new: 527376779 decicycles in build_filter(loop 1000), 256 runs, 0 skips 524361765 decicycles in build_filter(loop 1000), 512 runs, 0 skips 516552574 decicycles in build_filter(loop 1000), 1024 runs, 0 skips old: 974178658 decicycles in build_filter(loop 1000), 256 runs, 0 skips 972794408 decicycles in build_filter(loop 1000), 512 runs, 0 skips 954350046 decicycles in build_filter(loop 1000), 1024 runs, 0 skips Note that lower level optimizations are entirely possible, I focussed on getting the high level semantics correct. In any case, this should provide a good foundation. Reviewed-by: Michael Niedermayer <michael@niedermayer.cc> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
9bec6d71 · Ganesh Ajjanagadde · cc35f6f4 · 9bec6d71
Commit 9bec6d71 authored Nov 04, 2015 by Ganesh Ajjanagadde
Hide whitespace changes
Inline Side-by-side

Showing with 40 additions and 4 deletions

resample.c libswresample/resample.c +40 -4

No files found.
--- a/libswresample/resample.c
+++ b/libswresample/resample.c
@@ -74,7 +74,7 @@ static int build_filter(ResampleContext *c, void *filter, double factor, int tap
                        int filter_type, int kaiser_beta){
    int ph, i;
    double x, y, w;
-    double *tab = av_malloc_array(tap_count,  sizeof(*tab));
+    double *tab = av_malloc_array(tap_count+1,  sizeof(*tab));
    const int center= (tap_count-1)/2;

    if (!tab)
@@ -84,9 +84,10 @@ static int build_filter(ResampleContext *c, void *filter, double factor, int tap
    if (factor > 1.0)
        factor = 1.0;

-    for(ph=0;ph<phase_count;ph++) {
+    av_assert0(phase_count == 1 || phase_count % 2 == 0);
+    for(ph = 0; ph <= phase_count / 2; ph++) {
        double norm = 0;
-        for(i=0;i<tap_count;i++) {
+        for(i=0;i<=tap_count;i++) {
            x = M_PI * ((double)(i - center) - (double)ph / phase_count) * factor;
            if (x == 0) y = 1.0;
            else        y = sin(x) / x;
@@ -110,7 +111,8 @@ static int build_filter(ResampleContext *c, void *filter, double factor, int tap
            }

            tab[i] = y;
-            norm += y;
+            if (i < tap_count)
+                norm += y;
        }

        /* normalize so that an uniform color remains the same */
@@ -118,18 +120,52 @@ static int build_filter(ResampleContext *c, void *filter, double factor, int tap
        case AV_SAMPLE_FMT_S16P:
            for(i=0;i<tap_count;i++)
                ((int16_t*)filter)[ph * alloc + i] = av_clip(lrintf(tab[i] * scale / norm), INT16_MIN, INT16_MAX);
+            if (tap_count % 2 == 0) {
+                for (i = 0; i < tap_count; i++)
+                    ((int16_t*)filter)[(phase_count-ph) * alloc + tap_count-1-i] = ((int16_t*)filter)[ph * alloc + i];
+            }
+            else {
+                for (i = 1; i <= tap_count; i++)
+                    ((int16_t*)filter)[(phase_count-ph) * alloc + tap_count-i] =
+                        av_clip(lrintf(tab[i] * scale / (norm - tab[0] + tab[tap_count])), INT16_MIN, INT16_MAX);
+            }
            break;
        case AV_SAMPLE_FMT_S32P:
            for(i=0;i<tap_count;i++)
                ((int32_t*)filter)[ph * alloc + i] = av_clipl_int32(llrint(tab[i] * scale / norm));
+            if (tap_count % 2 == 0) {
+                for (i = 0; i < tap_count; i++)
+                    ((int32_t*)filter)[(phase_count-ph) * alloc + tap_count-1-i] = ((int32_t*)filter)[ph * alloc + i];
+            }
+            else {
+                for (i = 1; i <= tap_count; i++)
+                    ((int32_t*)filter)[(phase_count-ph) * alloc + tap_count-i] =
+                        av_clipl_int32(llrint(tab[i] * scale / (norm - tab[0] + tab[tap_count])));
+            }
            break;
        case AV_SAMPLE_FMT_FLTP:
            for(i=0;i<tap_count;i++)
                ((float*)filter)[ph * alloc + i] = tab[i] * scale / norm;
+            if (tap_count % 2 == 0) {
+                for (i = 0; i < tap_count; i++)
+                    ((float*)filter)[(phase_count-ph) * alloc + tap_count-1-i] = ((float*)filter)[ph * alloc + i];
+            }
+            else {
+                for (i = 1; i <= tap_count; i++)
+                    ((float*)filter)[(phase_count-ph) * alloc + tap_count-i] = tab[i] * scale / (norm - tab[0] + tab[tap_count]);
+            }
            break;
        case AV_SAMPLE_FMT_DBLP:
            for(i=0;i<tap_count;i++)
                ((double*)filter)[ph * alloc + i] = tab[i] * scale / norm;
+            if (tap_count % 2 == 0) {
+                for (i = 0; i < tap_count; i++)
+                    ((double*)filter)[(phase_count-ph) * alloc + tap_count-1-i] = ((double*)filter)[ph * alloc + i];
+            }
+            else {
+                for (i = 1; i <= tap_count; i++)
+                    ((double*)filter)[(phase_count-ph) * alloc + tap_count-i] = tab[i] * scale / (norm - tab[0] + tab[tap_count]);
+            }
            break;
        }
    }