All of this is on an in-order A53 core. Modern cores have better optimizations, and ffmpeg's MDCT was optimized for A72s. I suspect that these figures will be better on more modern cores, but I currently have access to only an old Odroid C2.
For libopus, the MDCT constitutes 26% of the current overall decoding time.
For ffmpeg's opus decoder, MDCT is 30% of the overall decode time.

libopus's MDCT is currently 36.5% slower than ffmpeg's MDCT on aarch64. ffmpeg's MDCT is not fully optimized on aarch64, so this gap will only increase.
By degrading the optimizations in ffmpeg's x86 MDCT to the current state of ffmpeg's aarch64 MDCT, the performance gain on top of the current state would be 2x once fully optimized.
Hence, the final potential speedup of the MDCT would be around 73%.
Given that the MDCT is currently 26% of decoding, I suspect a total speedup/power reduction of 19% (26*0.73).

All of these tests assume the worst case option (e.g. the fastest), which is with libopus's current native transforms. However, if libNE10 is available, libopus will, _by default_, on aarch64, be built with this library to handle MDCT. The logic is that the library has better optimizations than libopus. Hence a lot of projects which statically link to libopus built theirs on aarch64 with libNE10 enabled.
However, in my tests, libNE10 is in fact 5% slower than libopus's native MDCT. Assuming your libopus builds currently use libNE10, the speedup, if using ffmpeg's MDCT, would likely be around 26%.

I also noticed that neither the deemphasis filter, nor the postfilter are currently SIMD'd in libopus. Combined, they are around 5% of current decoding time. Cutting that down via SIMD, I expect another 6% speedup from those 2 functions.

To compare MDCTs directly, I ripped out all MDCT code from libopus and added it to my testing program - https://github.com/cyanreg/lavu_fft_test
You're welcome to try to replicate these results by running the program on cores that you're interested in.
To check how many percent of decoding MDCT is, I simply commented out all calls to clt_mdct_backward in libopus, and measured the speed of the decoder before and after.

Furthermore, I suspect I can squeeze a few percent by optimizing haar() in decode_bands(). Optimistically, I think I can cut down the total power consumption by one third, assuming performance correlates with efficiency. Since we would be using a lot more SIMD instructions, its likely that the CPU would be able to pipeline better, further reducing power.