Help us keep the list up to date and submit new video software here.

  Search or Browse all software by sections

Tool Description Type Rating Comment


x264 Encoder is an open source H264/AVC based video encoder. The x264 CLI is a command line x264 encoder tool and is used in several converters like Handbrake, Xvid4PSP, StaxRip, RipBot264, FairUse Wizard, MEGUI.

Free software
OS:Windows MacOS Linux
File size:8.3MB
Portable version
Version history available

16 votes

Similar tools
Read 16


  Latest version

r2705 (June 26, 2016)

Download sites

Visit developer's site

Download x264 Encoder r2705 (portable)

Download Mac, Linux, mirror and other versions

Download x264 Encoder r2705 from another mirror site

Download x264 Encoder Mac version

Download x264 Encoder Linux version

Supported operating systems

Windows Mac OS Linux

More information and other downloads

Download Komisar's unoffical x264 VFW Codec here or another an unoffical x264 VFW Codec here, use x264 in for example Virtualdub or other that supports Video For Windows(VFW) Codecs. Both encoding and decoding.

x264 Encoder GUIs/Frontends:
Handbrake, Xvid4PSP, StaxRip, RipBot264, FairUse Wizard, MEGUI, AutoX264, HDConvertToX.

x264 Encoder screenshot
Click to enlarge screenshot
Version history / Release notes / Changelog

commit 3f5ed56d4105f68c01b86f94f41bb9bbefa3433b [revision 2705]
Author: Henrik Gramner <>
Date: Sun Apr 3 17:28:33 2016 +0200

configure: Support specifying a custom pkg-config

commit 7c9c687d8062f72b3ec300de8997bdae8277a741 [revision 2704]
Author: Anton Mitrofanov <>
Date: Wed Jun 8 22:46:17 2016 +0300

Add support for new VUI parameters

Support the new color primaries, transfer characteristics, and matrix
coefficients defined in the 2016-02 edition of the H.264 specification.

commit 92515e8ff73491ef8a44c85e0bee265ba5791070 [revision 2703]
Author: Henrik Gramner <>
Date: Sun Apr 24 14:10:22 2016 +0200

configure: Add link-time optimization support

Enabled by using the --enable-lto configuration option.

May give a slight performance improvement in some cases, but it can
also reduce performance in other cases (largely compiler-dependant)
so don't enable it by default. It also makes compilation (and linking
in particular) a fair bit slower.

Note that some older versions of GNU binutils will incorrectly warn
about "memset used with constant zero length parameter" when linking
using LTO. This is due to a bug in binutils and can safely be ignored.

commit b6267e0ff770545de88dfb5d3f176ea73f453730 [revision 2702]
Author: Henrik Gramner <>
Date: Sun Apr 24 13:32:43 2016 +0200

configure: Fix clang detection with versioned binaries

Correctly detect clang binaries that has the version number appended
as a suffix to the file name, e.g. `clang38`.

commit 14a58532fea2c5f9e7b93c918476d842091c4268 [revision 2701]
Author: Janne Grunau <>
Date: Sun Apr 24 14:38:56 2016 +0200

arm: Add asm for mbtree fixed point conversion

7-8 times faster on a cortex-a53 vs. gcc-5.3.

mbtree_fix8_pack_c: 44114
mbtree_fix8_pack_neon: 5805
mbtree_fix8_unpack_c: 38924
mbtree_fix8_unpack_neon: 4870

commit b6f189eb4c5646483f7901293944695167e71ed9 [revision 2700]
Author: Janne Grunau <>
Date: Sun Apr 24 14:38:55 2016 +0200

aarch64: Add asm for mbtree fixed point conversion

pack is ~7 times faster and unpack is ~9 times faster on a cortex-a53
compared to gcc-5.3.

mbtree_fix8_pack_c: 41534
mbtree_fix8_pack_neon: 5766
mbtree_fix8_unpack_c: 44102
mbtree_fix8_unpack_neon: 4868

commit a5e06b9a435852f0125de4ecb198ad47340483fa [revision 2699]
Author: Anton Mitrofanov <>
Date: Sun May 22 22:33:58 2016 +0300

Fix p4x4 analyse for 4:4:4 encoding with chroma ME

commit 07221290db0a94bda1f6ece3fdf3c02675c8adce [revision 2698]
Author: Anton Mitrofanov <>
Date: Sun May 22 22:18:34 2016 +0300

Fix 4:4:4 encoding with CQM

commit 23ebc1f763936b7fcfc81e21530e1b65dbc503b9 [revision 2697]
Author: Anton Mitrofanov <>
Date: Sun May 22 19:36:05 2016 +0300

Fix p4x4 RDO with CAVLC

commit 740a8c556bd9b68e899d6991f3f987a443aa14aa [revision 2696]
Author: Anton Mitrofanov <>
Date: Sat Apr 23 23:10:03 2016 +0300

Apply zone options a little bit earlier

This way things like SAR changes will have full effect from the start frame.

commit 928bd9d5def4f0ca5071ea176a11b816a01e6495 [revision 2695]
Author: Anton Mitrofanov <>
Date: Sat Apr 23 22:45:44 2016 +0300

Fix corruption when using encoder_reconfig() with some parameters

Changing parameters that affects SPS, like --ref for example, wasn't
behaving correctly previously.

Probably a regression in r2373.

16 hours agoClean up header includes master
commit | commitdiff | tree
Anton Mitrofanov [Wed, 13 Apr 2016 18:54:25 +0000 (21:54 +0300)]
Clean up header includes

16 hours agoEliminate some compiler warnings on BSD
commit | commitdiff | tree
Henrik Gramner [Wed, 13 Apr 2016 15:53:49 +0000 (17:53 +0200)]
Eliminate some compiler warnings on BSD

Include <strings.h> in addition to <string.h>. According to the POSIX
specification the prototypes for strcasecmp() and strncasecmp() are
declared in <strings.h>. On some systems they are also declared in
<string.h> for compatibility reasons but we shouldn't rely on that.

Define _POSIX_C_SOURCE only when it's required to do so. Some BSD
variants doesn't declare certain function prototypes otherwise.

25 hours agoosx: Add -D_DARWIN_C_SOURCE to CFLAGS master
commit | commitdiff | tree
Henrik Gramner [Tue, 12 Apr 2016 19:33:54 +0000 (21:33 +0200)]

OSX doesn't like _POSIX_C_SOURCE being defined when _DARWIN_C_SOURCE isn't.

27 hours agoRemove an unused parameter from x264_slicetype_frame_cost()
commit | commitdiff | tree
Anton Mitrofanov [Tue, 12 Apr 2016 17:33:42 +0000 (20:33 +0300)]
Remove an unused parameter from x264_slicetype_frame_cost()

The b_intra_penalty parameter is no longer used anywhere after the
improvements to the --b-adapt 1 algorithm.

27 hours agoImprove the --b-adapt 1 algorithm
commit | commitdiff | tree
Anton Mitrofanov [Sun, 10 Apr 2016 17:17:32 +0000 (20:17 +0300)]
Improve the --b-adapt 1 algorithm

Roughly the same speed as before but with significantly better results,
comparable to --b-adapt 2.

29 hours agoanalyse: i_sub_partition write combining
commit | commitdiff | tree
Henrik Gramner [Sun, 3 Apr 2016 13:49:26 +0000 (15:49 +0200)]
analyse: i_sub_partition write combining

29 hours agox86: Use one less register in mbtree_propagate_cost_avx2
commit | commitdiff | tree
Henrik Gramner [Tue, 15 Mar 2016 19:16:45 +0000 (20:16 +0100)]
x86: Use one less register in mbtree_propagate_cost_avx2

Avoids the need to save and restore xmm6 on 64-bit Windows.

29 hours agox86: Add asm for mbtree fixed point conversion
commit | commitdiff | tree
Henrik Gramner [Fri, 4 Mar 2016 16:53:08 +0000 (17:53 +0100)]
x86: Add asm for mbtree fixed point conversion

The QP offsets of each macroblock are stored as floats internally and
converted to big-endian Q8.8 fixed point numbers when written to the 2-pass
stats file, and converted back to floats when read from the stats file.

Add SSSE3 and AVX2 implementations for conversions in both directions.

About 8x faster than C on Haswell.

29 hours agox86inc: Enable AVX emulation in additional cases
commit | commitdiff | tree
Anton Mitrofanov [Thu, 7 Apr 2016 10:09:03 +0000 (13:09 +0300)]
x86inc: Enable AVX emulation in additional cases

Allows emulation to work when dst is equal to src2 as long as the
instruction is commutative, e.g. `addps m0, m1, m0`.

29 hours agox86inc: Improve handling of %ifid with multi-token parameters
commit | commitdiff | tree
Anton Mitrofanov [Thu, 7 Apr 2016 09:48:29 +0000 (12:48 +0300)]
x86inc: Improve handling of %ifid with multi-token parameters

The yasm/nasm preprocessor only checks the first token, which means that
parameters such as `dword [rax]` are treated as identifiers, which is
generally not what we want.

29 hours agox86inc: Fix AVX emulation of some instructions
commit | commitdiff | tree
Anton Mitrofanov [Mon, 28 Mar 2016 15:35:38 +0000 (18:35 +0300)]
x86inc: Fix AVX emulation of some instructions

29 hours agox86inc: Fix AVX emulation of scalar float instructions
commit | commitdiff | tree
Henrik Gramner [Fri, 4 Mar 2016 16:51:41 +0000 (17:51 +0100)]
x86inc: Fix AVX emulation of scalar float instructions

Those instructions are not commutative since they only change the first
element in the vector and leave the rest unmodified.

29 hours agox86: dct2x4dc asm
commit | commitdiff | tree
Henrik Gramner [Sat, 27 Feb 2016 19:34:39 +0000 (20:34 +0100)]
x86: dct2x4dc asm

Only used in 4:2:2. MMX2 version implemented for 8-bit, SSE2 and AVX
versions implemented for high bit-depth.

2.5x faster on 32-bit and 1.6x faster on 64-bit compared to C on Ivy Bridge.

29 hours agox86: SSE2/AVX idct_dequant_2x4_(dc|dconly)
commit | commitdiff | tree
Henrik Gramner [Sat, 20 Feb 2016 19:31:22 +0000 (20:31 +0100)]
x86: SSE2/AVX idct_dequant_2x4_(dc|dconly)

Only used in 4:2:2. Both 8-bit and high bit-depth implemented.

Approximate performance improvement compared to C on Ivy Bridge:

x86-32 x86-64
idct_dequant_2x4_dc 2.1x 1.7x
idct_dequant_2x4_dconly 2.7x 2.0x

Helps more on 32-bit due to the C versions being register starved.

29 hours agocheckasm: Fix idct_dequant_2x4_(dc|dconly) tests
commit | commitdiff | tree
Henrik Gramner [Sat, 20 Feb 2016 15:53:35 +0000 (16:53 +0100)]
checkasm: Fix idct_dequant_2x4_(dc|dconly) tests

They used the wrong qp values and the dconly test had the wrong name. This
was undetected before because there wasn't any assembly implementations.

29 hours agocheckasm: Disable Windows Error Reporting
commit | commitdiff | tree
Henrik Gramner [Sun, 7 Feb 2016 13:55:26 +0000 (14:55 +0100)]
checkasm: Disable Windows Error Reporting

When developing new assembly code it's expected that checkasm may crash,
and the error reporting dialog popup can be somewhat annoying.

29 hours agowindows: Flag debug builds in the resource file
commit | commitdiff | tree
Henrik Gramner [Sat, 6 Feb 2016 17:49:46 +0000 (18:49 +0100)]
windows: Flag debug builds in the resource file

29 hours agocli: Refactor filter option parsing
commit | commitdiff | tree
Henrik Gramner [Thu, 4 Feb 2016 19:06:57 +0000 (20:06 +0100)]
cli: Refactor filter option parsing

The old code contained a whole bunch of memory leaks, unchecked mallocs,
sections of dead code, etc. and was generally overly complex.

Also consolidate some memory allocations into a single one.

2 days agoffms: Various improvements
commit | commitdiff | tree
Henrik Gramner [Sun, 31 Jan 2016 20:50:52 +0000 (21:50 +0100)]
ffms: Various improvements

* Drop the MinGW Unicode workarounds. Those were required at the time
Windows Unicode support was added to x264 but the underlying problem
has since been fixed in FFMS.

* Use FFMS_IndexBelongsToFile() as an additional sanity check when reading
an index file to ensure that it belongs to the current source video.

* Upgrade to the new API to prevent deprecation warnings when compiling.

* Fix a resource leak that would occur if FFMS_GetFirstTrackOfType() or
FFMS_CreateVideoSource() failed.

* Minor string handling adjustments related to progress reporting.

This increases the FFMS version requirement from 2.16.2 to 2.21.0.

2 days agomsvc: Add snprintf/vsnprintf replacements
commit | commitdiff | tree
Henrik Gramner [Mon, 11 Apr 2016 14:59:46 +0000 (16:59 +0200)]
msvc: Add snprintf/vsnprintf replacements

MSVC pre-VS2015 has broken snprintf/vsnprintf implementations which are
incompatible with C99 and may lead to buffer overflows.

2 days agoconfigure: Define feature test macros for --std=gnu99
commit | commitdiff | tree
Henrik Gramner [Sun, 31 Jan 2016 19:21:01 +0000 (20:21 +0100)]
configure: Define feature test macros for --std=gnu99

Makes the printf() family functions on MinGW use the correct C99 POSIX
versions instead of the broken pre-VS2015 Microsoft ones.

Also allows us to get rid of some _GNU_SOURCE and _ISOC99_SOURCE defines.

2 days agomingw: Enable high-entropy ASLR on 64-bit Windows
commit | commitdiff | tree
Henrik Gramner [Thu, 28 Jan 2016 17:37:37 +0000 (18:37 +0100)]
mingw: Enable high-entropy ASLR on 64-bit Windows

To fully utilize HEASLR the image base address must also be set above
4 GiB. For consistency use the same address as MSVC uses by default.

This requires binutils 2.25 which isn't available on all common
distributions, so only enable it after checking that it's supported.

2 days agomsvs: WinRT support
commit | commitdiff | tree
Henrik Gramner [Sun, 24 Jan 2016 00:48:18 +0000 (01:48 +0100)]
msvs: WinRT support

To compile x264 for WinRT the following additional steps has to be performed.

* Ensure that the necessary SDK is installed.

* Set the correct environment variables in the VS command prompt as shown at

* Add one of the following to --extra-cflags depending on the target OS:

2 days agoconfigure: Disable CLI libraries when CLI is disabled
commit | commitdiff | tree
Henrik Gramner [Sun, 24 Jan 2016 22:58:40 +0000 (23:58 +0100)]
configure: Disable CLI libraries when CLI is disabled

2 days agomatroska: mk_close: Check fseek() return value
commit | commitdiff | tree
Henrik Gramner [Fri, 5 Feb 2016 17:46:13 +0000 (18:46 +0100)]
matroska: mk_close: Check fseek() return value

2 days agoparse_qpfile: Check ftell() and fseek() return values
commit | commitdiff | tree
Henrik Gramner [Fri, 5 Feb 2016 17:46:02 +0000 (18:46 +0100)]
parse_qpfile: Check ftell() and fseek() return values

2 days agoUse the correct default B-ref placement with B-pyramid stable
commit | commitdiff | tree
Anton Mitrofanov [Sun, 10 Apr 2016 17:13:59 +0000 (20:13 +0300)]
Use the correct default B-ref placement with B-pyramid

Cost analyse functions expects the placement of the B-ref in a sequence of
an even number of B-frames to be located towards the beginning while the
actual placement was towards the end.

Change the placement to be consistent with the analyse expectations, e.g.
PbbBbP -> PbBbbP.

2 months agoparse_zones: Fix memory leak
commit | commitdiff | tree
Henrik Gramner [Fri, 5 Feb 2016 17:45:47 +0000 (18:45 +0100)]
parse_zones: Fix memory leak

2 months agoFix float-cast-overflow in x264_ratecontrol_end function
commit | commitdiff | tree
Alexey Samsonov [Tue, 26 Jan 2016 00:05:25 +0000 (16:05 -0800)]
Fix float-cast-overflow in x264_ratecontrol_end function

According to the C standard, it is undefined behavior to cast a negative
floating point number to an unsigned integer. Float-cast-overflow in
general is known to produce different results on different architectures.

Building x264 code with Clang and -fsanitize=float-cast-overflow
and running it on some real-life examples occasionally produces errors
of the form:

encoder/ratecontrol.c:1892: runtime error: value -5011.14 is outside the
range of representable values of type 'unsigned short'

Fix these errors by explicitly coding the de-facto x86 behavior: casting
float to uint16_t through int16_t.

13 hours agoFix AVC-Intra padding for non-Annex B encoding master
commit | commitdiff | tree
Sebastian Dröge [Sun, 20 Dec 2015 21:49:35 +0100 (23:49 +0300)]
Fix AVC-Intra padding for non-Annex B encoding

12 hours agoppc: Only perform AltiVec detection if compiled with AltiVec enabled master
commit | commitdiff | tree
Anton Mitrofanov [Mon, 11 Jan 2016 19:39:22 +0100 (21:39 +0300)]
ppc: Only perform AltiVec detection if compiled with AltiVec enabled

33 hours ago2-pass: Take into account possible frame reordering
commit | commitdiff | tree
Anton Mitrofanov [Tue, 13 Oct 2015 13:30:16 +0100 (15:30 +0300)]
2-pass: Take into account possible frame reordering

33 hours agoRevise the 2-pass algorithm
commit | commitdiff | tree
Anton Mitrofanov [Tue, 13 Oct 2015 10:54:05 +0100 (12:54 +0300)]
Revise the 2-pass algorithm

33 hours agoRevise the row VBV algorithm (part 2)
commit | commitdiff | tree
Anton Mitrofanov [Tue, 5 Jan 2016 00:41:43 +0100 (02:41 +0300)]
Revise the row VBV algorithm (part 2)

Should fix rare cases of VBV emergency mode activation caused by too much trust
to the row predictors.

33 hours agoBump dates to 2016
commit | commitdiff | tree
Henrik Gramner [Fri, 1 Jan 2016 12:44:31 +0100 (12:44 +0100)]
Bump dates to 2016

33 hours agocli: Use memory-mapped input frames for yuv and y4m
commit | commitdiff | tree
Henrik Gramner [Mon, 26 Oct 2015 19:54:20 +0100 (19:54 +0100)]
cli: Use memory-mapped input frames for yuv and y4m

Improves performance by avoiding extraneous memory copying.
Most beneficial on fast settings.

On average around 5-10% faster overall on ultrafast but the
performance improvement can be even larger in some cases.

33 hours agoy4m: Support extended frame headers when seeking
commit | commitdiff | tree
Henrik Gramner [Thu, 7 Jan 2016 01:59:24 +0100 (01:59 +0100)]
y4m: Support extended frame headers when seeking

Use the actual length of the frame header of the first frame instead of
assuming a header without extensions when calculating the frame size.

Also makes the frame counter more accurate with extended frame headers.

33 hours agoconfigure: Simplify cygwin/mingw/msys code
commit | commitdiff | tree
Henrik Gramner [Tue, 3 Nov 2015 17:55:08 +0100 (17:55 +0100)]
configure: Simplify cygwin/mingw/msys code

Avoids some code duplication.

Also drop the -mno-cygwin check since that option was removed back in 2008.

33 hours agoy4m: Avoid some redundant strlen() calls
commit | commitdiff | tree
Henrik Gramner [Mon, 26 Oct 2015 18:52:46 +0100 (18:52 +0100)]
y4m: Avoid some redundant strlen() calls

33 hours agoSimplify threadpool_wait
commit | commitdiff | tree
Henrik Gramner [Sun, 25 Oct 2015 17:15:10 +0100 (17:15 +0100)]
Simplify threadpool_wait

33 hours agowindows: Use native threads by default
commit | commitdiff | tree
Henrik Gramner [Fri, 16 Oct 2015 18:05:34 +0100 (19:05 +0200)]
windows: Use native threads by default

--disable-win32thread can be passed as an argument to configure to compile
with pthreads, which was the old default behavior.

33 hours agox86: Avoid some bypass delays and false dependencies
commit | commitdiff | tree
Henrik Gramner [Sun, 11 Oct 2015 21:32:11 +0100 (22:32 +0200)]
x86: Avoid some bypass delays and false dependencies

A bypass delay of 1-3 clock cycles may occur on some CPUs when transitioning
between int and float domains, so try to avoid that if possible.

33 hours agox86: Enable high bit-depth x264_coeff_last64_avx2_lzcnt
commit | commitdiff | tree
Henrik Gramner [Sun, 11 Oct 2015 21:32:03 +0100 (22:32 +0200)]
x86: Enable high bit-depth x264_coeff_last64_avx2_lzcnt

The function existed but was never enabled.

33 hours agox86inc: Add debug symbols indicating sizes of compiled functions
commit | commitdiff | tree
Geza Lore [Mon, 12 Oct 2015 13:13:42 +0100 (13:13 +0100)]
x86inc: Add debug symbols indicating sizes of compiled functions

Some debuggers/profilers use this metadata to determine which function a
given instruction is in; without it they get can confused by local labels
(if you haven't stripped those). On the other hand, some tools are still
confused even with this metadata. e.g. this fixes `gdb`, but not `perf`.

Currently only implemented for ELF.

33 hours agox86inc: Avoid creating unnecessary local labels
commit | commitdiff | tree
Henrik Gramner [Fri, 16 Oct 2015 20:28:49 +0100 (21:28 +0200)]
x86inc: Avoid creating unnecessary local labels

The REP_RET workaround is only needed on old AMD cpus, and the labels clutter
up the symbol table and confuse debugging/profiling tools, so use EQU to
create SHN_ABS symbols instead of creating local labels. Furthermore, skip
the workaround completely in functions that definitely won't run on such cpus.

This patch doesn't modify any emitted instructions, and doesn't actually affect
x264 at all. It's only for other projects that use x86inc.asm without an
appropriate `strip` command in their buildsystem.

Note that EQU is just creating a local label when using nasm instead of yasm.
This is probably a bug, but at least it doesn't break anything.

33 hours agox86inc: Simplify AUTO_REP_RET
commit | commitdiff | tree
Henrik Gramner [Thu, 15 Oct 2015 16:42:49 +0100 (17:42 +0200)]
x86inc: Simplify AUTO_REP_RET

cpuflags is never undefined any more, it's set to 0 instead.

Also fix an incorrect comment.

33 hours agox86inc: Use more consistent indentation
commit | commitdiff | tree
Henrik Gramner [Mon, 12 Oct 2015 20:55:11 +0100 (21:55 +0200)]
x86inc: Use more consistent indentation

33 hours agox86inc: Preserve arguments when allocating stack space
commit | commitdiff | tree
Henrik Gramner [Mon, 12 Oct 2015 19:15:18 +0100 (20:15 +0200)]
x86inc: Preserve arguments when allocating stack space

When allocating stack space with a larger alignment than the known stack
alignment a temporary register is used for storing the stack pointer.
Ensure that this isn't one of the registers used for passing arguments.

33 hours agox86inc: Improve FMA instruction handling
commit | commitdiff | tree
Henrik Gramner [Sun, 17 Jan 2016 00:25:47 +0100 (00:25 +0100)]
x86inc: Improve FMA instruction handling

* Correctly handle FMA instructions with memory operands.
* Print a warning if FMA instructions are used without the correct cpuflag.
* Simplify the instantiation code.
* Clarify documentation.

Only the last operand in FMA3 instructions can be a memory operand. When
converting FMA4 instructions to FMA3 instructions we can utilize the fact
that multiply is a commutative operation and reorder operands if necessary
to ensure that a memory operand is used only as the last operand.

2 weeks agox86inc: Be more verbose in assertion failures
commit | commitdiff | tree
Henrik Gramner [Sun, 11 Oct 2015 21:31:53 +0100 (22:31 +0200)]
x86inc: Be more verbose in assertion failures

2 weeks agox86inc: Make cpuflag() and notcpuflag() return 0 or 1
commit | commitdiff | tree
Henrik Gramner [Wed, 30 Sep 2015 22:17:00 +0100 (23:17 +0200)]
x86inc: Make cpuflag() and notcpuflag() return 0 or 1

Makes it possible to use them in arithmetic expressions.

2 weeks agoencoder_open: Fix memory leak stable
commit | commitdiff | tree
Henrik Gramner [Fri, 30 Oct 2015 16:55:49 +0100 (16:55 +0100)]
encoder_open: Fix memory leak

Furthermore, the x264_analyse_prepare_costs() and x264_analyse_init_costs()
functions were only used in x264_encoder_open(), so move that entire section
of code to analyse.c as well to simplify things.

4 weeks agoarm: do not fill mc_weight*_neon tabs for HIGH_BIT_DEPTH
commit | commitdiff | tree
Janne Grunau [Wed, 18 Nov 2015 11:08:22 +0100 (11:08 +0100)]
arm: do not fill mc_weight*_neon tabs for HIGH_BIT_DEPTH

The asm is only for 8-bit and function prototypes reflect that. Avoids
numerous warnings with --bit-depth=9/10.

4 weeks agoarm: Eliminate text relocations in asm
commit | commitdiff | tree
Janne Grunau [Tue, 13 Oct 2015 22:50:11 +0100 (23:50 +0200)]
arm: Eliminate text relocations in asm

Android 6 does not link shared libraries with text relocations.

Make the movrel macro position independent and add movrelx for indirect
loads of external symbols.

Move the function pointer table for the aligned memcpy variants to the section on Linux/Android.

3 months agoarm: Don't assume alignment in mbtree_propagate_list_internal where it isn't provided
commit | commitdiff | tree
Martin Storsjö [Thu, 15 Oct 2015 09:50:33 +0100 (11:50 +0300)]
arm: Don't assume alignment in mbtree_propagate_list_internal where it isn't provided

3 months agoarm: Fix checkasm register clobber check on iOS
commit | commitdiff | tree
Janne Grunau [Tue, 13 Oct 2015 22:50:12 +0100 (23:50 +0200)]
arm: Fix checkasm register clobber check on iOS

r9 is a volatile register in the iOS ABI and will therefore not be
preserved by compiled functions like the luma motion compensation.

Add the symbol prefix to the puts() call and use blx since a switch
between arm and thumb mode might be required.

14 hours agoppc: Add detection of AltiVec support for FreeBSD master
commit | commitdiff | tree
Anton Mitrofanov [Thu, 1 Oct 2015 00:02:16 +0200 (01:02 +0300)]
ppc: Add detection of AltiVec support for FreeBSD

Patch from FreeBSD ports.

14 hours agoDon't assume 16-byte stack alignment by default on x86-32
commit | commitdiff | tree
Anton Mitrofanov [Mon, 28 Sep 2015 20:07:55 +0200 (21:07 +0300)]
Don't assume 16-byte stack alignment by default on x86-32

Some compilers depending on target OS uses 4-byte stack alignment by default.
Explicitly check known good compilers and specific options for stack alignment.

14 hours agoFix a few static analyzer performance hints
commit | commitdiff | tree
Anton Mitrofanov [Tue, 22 Sep 2015 20:33:07 +0200 (21:33 +0300)]
Fix a few static analyzer performance hints

14 hours agoRevise the row VBV algorithm
commit | commitdiff | tree
Anton Mitrofanov [Tue, 22 Sep 2015 19:19:23 +0200 (20:19 +0300)]
Revise the row VBV algorithm

14 hours agoFix high bit depth lookahead cost compensation algorithm
commit | commitdiff | tree
Anton Mitrofanov [Tue, 22 Sep 2015 18:26:25 +0200 (19:26 +0300)]
Fix high bit depth lookahead cost compensation algorithm

Now high bit depth VBV should act more like 8-bit depth one.

14 hours agoCorrectly update the intra row predictor in B-frames
commit | commitdiff | tree
Anton Mitrofanov [Tue, 22 Sep 2015 18:05:52 +0200 (19:05 +0300)]
Correctly update the intra row predictor in B-frames

It was previously used but never updated from it's initialization value.

14 hours agoChange the predictors update algorithm
commit | commitdiff | tree
Anton Mitrofanov [Tue, 22 Sep 2015 17:58:24 +0200 (18:58 +0300)]
Change the predictors update algorithm

Keep predictor offsets more stable. This should fix VBV misprediction in frames
with a large difference in complexity between the top and bottom parts.

14 hours agoarm: Implement x264_mbtree_propagate_{cost, list}_neon
commit | commitdiff | tree
Martin Storsjö [Thu, 3 Sep 2015 08:30:44 +0200 (09:30 +0300)]
arm: Implement x264_mbtree_propagate_{cost, list}_neon

The cost function could be simplified to avoid having to clobber
q4/q5, but this requires reordering instructions which increase
the total runtime.

checkasm timing Cortex-A7 A8 A9
mbtree_propagate_cost_c 63702 155835 62829
mbtree_propagate_cost_neon 17199 10454 11106

mbtree_propagate_list_c 104203 108949 84532
mbtree_propagate_list_neon 82035 78348 60410

14 hours agox86: Share the mbtree_propagate_list macro with aarch64
commit | commitdiff | tree
Martin Storsjö [Thu, 3 Sep 2015 08:30:43 +0200 (09:30 +0300)]
x86: Share the mbtree_propagate_list macro with aarch64

This avoids having to duplicate the same code for all architectures
that implement only the internal part of this function in assembler.

14 hours agoarm: Implement luma intra deblocking
commit | commitdiff | tree
Martin Storsjö [Wed, 2 Sep 2015 21:39:51 +0200 (22:39 +0300)]
arm: Implement luma intra deblocking

checkasm timing Cortex-A7 A8 A9
deblock_luma_intra[0]_c 5988 4653 4316
deblock_luma_intra[0]_neon 3103 2170 2128
deblock_luma_intra[1]_c 7119 5905 5347
deblock_luma_intra[1]_neon 2068 1381 1412

This includes extra optimizations by Janne Grunau.

Timings from a separate build, on Exynos 5422:

Cortex-A7 A15
deblock_luma_intra[0]_c 6627 3300
deblock_luma_intra[0]_neon 3059 1128
deblock_luma_intra[1]_c 7314 4128
deblock_luma_intra[1]_neon 2038 720

14 hours agoarm: Implement some neon 8x16c intra predict functions
commit | commitdiff | tree
Martin Storsjö [Mon, 31 Aug 2015 21:40:31 +0200 (22:40 +0300)]
arm: Implement some neon 8x16c intra predict functions

checkasm timing Cortex-A7 A8 A9
intra_predict_8x16c_dct_c 862 540 590
intra_predict_8x16c_dct_neon 608 511 657
intra_predict_8x16c_h_c 972 707 719
intra_predict_8x16c_h_neon 722 656 672
intra_predict_8x16c_p_c 10183 9819 8655
intra_predict_8x16c_p_neon 2622 1972 1983

14 hours agoarm: Implement x264_plane_copy_neon
commit | commitdiff | tree
Martin Storsjö [Thu, 27 Aug 2015 23:15:01 +0200 (00:15 +0300)]
arm: Implement x264_plane_copy_neon

checkasm timing Cortex-A7 A8 A9
plane_copy_c 13124 10925 9106
plane_copy_neon 7349 5103 8945

14 hours agocheckasm: arm: Check register clobbering
commit | commitdiff | tree
Martin Storsjö [Fri, 28 Aug 2015 08:40:24 +0200 (09:40 +0300)]
checkasm: arm: Check register clobbering

Cast the function pointer to a different type signature, to
be able to use uint64_t as return type (instead of intptr_t) for
those calls that require it.

Use two separate functions, depending on whether neon is available.

14 hours agocheckasm: Try different widths for ssd_nv12
commit | commitdiff | tree
Martin Storsjö [Thu, 13 Aug 2015 23:00:57 +0200 (00:00 +0300)]
checkasm: Try different widths for ssd_nv12

To test all codepaths in the aarch64 neon implementation, one at
the very least needs to test with width 8, 16, 24 and 32.

14 hours agoHaiku support
commit | commitdiff | tree
Jerome Duval [Fri, 13 Jun 2014 21:56:27 +0200 (19:56 +0000)]
Haiku support

Add Haiku as supported platform in configure.
Haiku has no nice() function, use the platform specific substitute instead.

14 hours agocheckasm: aarch64: Check register clobbering
commit | commitdiff | tree
Martin Storsjö [Tue, 25 Aug 2015 13:38:20 +0200 (14:38 +0300)]
checkasm: aarch64: Check register clobbering

Disable this on iOS, since it has got a slightly different ABI
for vararg parameters.

14 hours agoarm: Implement x284_decimate_score15/16/64_neon
commit | commitdiff | tree
Martin Storsjö [Tue, 25 Aug 2015 22:36:45 +0200 (23:36 +0300)]
arm: Implement x284_decimate_score15/16/64_neon

checkasm timing Cortex-A7 A8 A9
decimate_score15_c 764 736 535
decimate_score15_neon 487 494 453
decimate_score16_c 782 727 553
decimate_score16_neon 487 494 521
decimate_score64_c 2361 2597 2011
decimate_score64_neon 1017 802 785

14 hours agoarm: Implement chroma intra deblock
commit | commitdiff | tree
Martin Storsjö [Tue, 25 Aug 2015 22:36:44 +0200 (23:36 +0300)]
arm: Implement chroma intra deblock

checkasm timing Cortex-A7 A8 A9
deblock_chroma_420_intra_mbaff_c 1469 1276 1181
deblock_chroma_420_intra_mbaff_neon 981 717 644
deblock_chroma_intra[1]_c 2954 2402 2321
deblock_chroma_intra[1]_neon 947 581 575
deblock_h_chroma_420_intra_c 2859 2509 2264
deblock_h_chroma_420_intra_neon 1480 1119 1028
deblock_h_chroma_422_intra_c 6211 5030 4792
deblock_h_chroma_422_intra_neon 2894 1990 2077

14 hours agoarm: Implement x264_pixel_sa8d_satd_16x16_neon
commit | commitdiff | tree
Martin Storsjö [Tue, 25 Aug 2015 13:38:17 +0200 (14:38 +0300)]
arm: Implement x264_pixel_sa8d_satd_16x16_neon

This requires spilling some registers to the stack,
contray to the aarch64 version.

checkasm timing Cortex-A7 A8 A9
sa8d_satd_16x16_neon 12936 6365 7492
sa8d_satd_16x16_separate_neon 14841 6605 8324

14 hours agoarm: Implement x264_deblock_h_chroma_mbaff_neon
commit | commitdiff | tree
Martin Storsjö [Tue, 25 Aug 2015 13:38:16 +0200 (14:38 +0300)]
arm: Implement x264_deblock_h_chroma_mbaff_neon

checkasm timing Cortex-A7 A8 A9
deblock_chroma_420_mbaff_c 1944 1706 1526
deblock_chroma_420_mbaff_neon 1210 873 865

14 hours agoarm: Implement x264_deblock_h_chroma_422_neon
commit | commitdiff | tree
Martin Storsjö [Tue, 25 Aug 2015 13:38:15 +0200 (14:38 +0300)]
arm: Implement x264_deblock_h_chroma_422_neon

checkasm timing Cortex-A7 A8 A9
deblock_h_chroma_422_c 6953 6269 5145
deblock_h_chroma_422_neon 3905 2569 2551

14 hours agoarm: Implement integral_init4/8h/v_neon
commit | commitdiff | tree
Martin Storsjö [Tue, 25 Aug 2015 13:38:14 +0200 (14:38 +0300)]
arm: Implement integral_init4/8h/v_neon

checkasm timing Cortex-A7 A8 A9
integral_init4h_c 10466 8590 6161
integral_init4h_neon 3021 1494 1800
integral_init4v_c 16250 13590 13628
integral_init4v_neon 3473 2073 3291
integral_init8h_c 10100 8275 5705
integral_init8h_neon 4403 2344 2751
integral_init8v_c 6403 4632 4999
integral_init8v_neon 1184 783 1306

14 hours agoarm: Implement x264_denoise_dct_neon
commit | commitdiff | tree
Martin Storsjö [Tue, 25 Aug 2015 13:38:13 +0200 (14:38 +0300)]
arm: Implement x264_denoise_dct_neon

checkasm timing Cortex-A7 A8 A9
denoise_dct_c 6604 5510 5858
denoise_dct_neon 1774 1139 1614

14 hours agoarm: Add x264_nal_escape_neon
commit | commitdiff | tree
Martin Storsjö [Tue, 25 Aug 2015 13:38:12 +0200 (14:38 +0300)]
arm: Add x264_nal_escape_neon

checkasm timing Cortex-A7 A8 A9
nal_escape_c 852758 879566 655497
nal_escape_neon 376831 450678 371673

14 hours agoarm: Add neon versions of vsad, asd8 and ssd_nv12_core
commit | commitdiff | tree
Martin Storsjö [Tue, 25 Aug 2015 13:38:11 +0200 (14:38 +0300)]
arm: Add neon versions of vsad, asd8 and ssd_nv12_core

These are straight translations of the aarch64 versions.

checkasm timing Cortex-A7 A8 A9
vsad_c 16234 10984 9850
vsad_neon 2132 1020 789

asd8_c 5859 3561 3543
asd8_neon 1407 1279 1250

ssd_nv12_c 608096 591072 426285
ssd_nv12_neon 72752 33549 41347

14 hours agocheckasm: Check the right output range for integral_initXh
commit | commitdiff | tree
Martin Storsjö [Tue, 25 Aug 2015 13:38:10 +0200 (14:38 +0300)]
checkasm: Check the right output range for integral_initXh

These functions write their output into sum+stride, while we previously
only checked [0..stride-8] within the sum array.

This catches the previously broken aarch64 version of these functions.

Also check up until stride-4 elements for init4h.

14 hours agoaarch64: Skip deblocking in 264_deblock_h_chroma_422_neon
commit | commitdiff | tree
Janne Grunau [Thu, 20 Aug 2015 13:55:54 +0200 (13:55 +0200)]
aarch64: Skip deblocking in 264_deblock_h_chroma_422_neon

If the parameters (alpha, beta, tc0[]) indicated that the deblocking
should have been skipped, every 2nd chrome line would have deblocked

deblock_h_chroma_422_neon: 2259 (before)
deblock_h_chroma_422_neon: 2192 (after)

14 hours agoaarch64: Optimize various intra_predict asm functions
commit | commitdiff | tree
Janne Grunau [Mon, 17 Aug 2015 16:39:20 +0200 (16:39 +0200)]
aarch64: Optimize various intra_predict asm functions

Make them at least as fast as the compiled C version (tested on
cortex-a53 vs. gcc 4.9.2).

C NEON (before) NEON (after)
intra_predict_4x4_dc: 260 335 260
intra_predict_4x4_dct: 210 265 200
intra_predict_8x8c_dc: 497 548 493
intra_predict_8x8c_v: 232 309 179 (arm64)
intra_predict_8x16c_dc: 795 830 790

14 hours agoaarch64: Faster intra_predict_4x4_h
commit | commitdiff | tree
Janne Grunau [Tue, 18 Aug 2015 10:25:10 +0200 (10:25 +0200)]
aarch64: Faster intra_predict_4x4_h

Use multiplication with 0x01010101 for splats.

On a cortex-a53:
gcc 4.9.2 llvm 3.6 neon (before) neon (after)
intra_predict_4x4_h: 162 147 160/155 139/135

14 hours agoaarch64: Fix coeff_level_run* macros with LLVM's assembler
commit | commitdiff | tree
Janne Grunau [Tue, 18 Aug 2015 10:25:09 +0200 (10:25 +0200)]
aarch64: Fix coeff_level_run* macros with LLVM's assembler

LLVM's integrated assembler does not treat symbols as integer constants.

14 hours agoaarch64: Remove commas LLVM's assembler complains about
commit | commitdiff | tree
Janne Grunau [Tue, 18 Aug 2015 10:25:08 +0200 (10:25 +0200)]
aarch64: Remove commas LLVM's assembler complains about

14 hours agoarm: Implement x264_sub8x16_dct_dc_neon
commit | commitdiff | tree
Martin Storsjö [Thu, 13 Aug 2015 22:59:31 +0200 (23:59 +0300)]
arm: Implement x264_sub8x16_dct_dc_neon

checkasm timing Cortex-A7 A8 A9
sub8x16_dct_dc_c 6386 3901 4080
sub8x16_dct_dc_neon 1491 698 917

14 hours agoarm: Optimize x264_deblock_h_chroma_neon
commit | commitdiff | tree
Martin Storsjö [Thu, 13 Aug 2015 22:59:28 +0200 (23:59 +0300)]
arm: Optimize x264_deblock_h_chroma_neon

Shuffle both chroma components together as a 16 bit unit, and
don't write the unchanged columns (like in x264_deblock_h_luma_neon
and in the aarch64 version of the function).

This causes a minor slowdown for x264_deblock_v_chroma_neon, but
it is negligible compared to the speedup.

checkasm timing Cortex-A7 A8 A9
deblock_chroma[1]_c 4817 4057 3601
deblock_chroma[1]_neon 1249 716 817 (before)
deblock_chroma[1]_neon 1249 766 845 (after)

deblock_h_chroma_420_c 3699 3275 2830
deblock_h_chroma_420_neon 2068 1414 1400 (before)
deblock_h_chroma_420_neon 1838 1355 1291 (after)

14 hours agoaarch64: Remove leftover commented out code
commit | commitdiff | tree
Martin Storsjö [Thu, 13 Aug 2015 22:59:27 +0200 (23:59 +0300)]
aarch64: Remove leftover commented out code

14 hours agoaarch64: Simplify the decimate_score functions
commit | commitdiff | tree
Martin Storsjö [Thu, 13 Aug 2015 22:59:26 +0200 (23:59 +0300)]
aarch64: Simplify the decimate_score functions

After doing a left shift by the number of bits returned by clz,
only bits set to zero can be shifted out, so if the register
was nonzero to start with (which is checked), it can't become
zero here.

14 hours agoarm: Use aligned loads in x264_coeff_last15_neon
commit | commitdiff | tree
Martin Storsjö [Thu, 13 Aug 2015 22:59:25 +0200 (23:59 +0300)]
arm: Use aligned loads in x264_coeff_last15_neon

After subtracting 2, the pointer will be aligned.

checkasm timing Cortex-A7 A8 A9
coeff_last15_c 423 375 230
coeff_last15_neon 350 420 404 (before)
coeff_last15_neon 350 400 394 (after)

14 hours agoarm: Simplify x264_predict_8x8c_p_neon
commit | commitdiff | tree
Martin Storsjö [Thu, 13 Aug 2015 22:59:24 +0200 (23:59 +0300)]
arm: Simplify x264_predict_8x8c_p_neon

This gets rid of a few unnecessary (and confusing) steps in
calculating the increment to i00.

checkasm timing Cortex-A7 A8 A9
intra_predict_8x8c_p_c 5525 4732 4755
intra_predict_8x8c_p_neon 1719 1140 1262 (before)
intra_predict_8x8c_p_neon 1663 1142 1255 (after)

14 hours agolavf: Use the prefixed name for pixel format enum stable
commit | commitdiff | tree
Vittorio Giovara [Tue, 15 Sep 2015 15:40:14 +0200 (15:40 +0200)]
lavf: Use the prefixed name for pixel format enum

5 weeks agoaarch64: fix x264_mbtree_propagate_cost_neon
commit | commitdiff | tree
Janne Grunau [Thu, 3 Sep 2015 00:21:58 +0200 (00:21 +0200)]
aarch64: fix x264_mbtree_propagate_cost_neon

The branch conditon caused the loop to execute one time more than
intended. Detected by a memory corruption on arm with the 1 to 1 port of
the function.

6 weeks agoaarch64: Fix integral_init4/8h_neon
commit | commitdiff | tree
Martin Storsjö [Thu, 13 Aug 2015 22:59:22 +0200 (23:59 +0300)]
aarch64: Fix integral_init4/8h_neon

The stride is the number of uint16_t elements and thus needs
to be shifted.

This issue had slipped unnoticed since checkasm didn't actually
verify the output of these functions.

6 weeks agox86: Fix integral_init4/8h_avx2
commit | commitdiff | tree
Henrik Gramner [Thu, 27 Aug 2015 19:53:00 +0200 (19:53 +0200)]
x86: Fix integral_init4/8h_avx2

The AVX2 implementation was using the wrong offsets. It went undetected due to
the checkasm test being incorrect.

32 hours agoSimplify inclusion of x264.h in C++ projects master
commit | commitdiff | tree
Mark Webster [Wed, 5 Aug 2015 05:28:17 +0200 (04:28 +0100)]
Simplify inclusion of x264.h in C++ projects

Name all structs to support forward declarations.
Add a conditional extern "C" wrapper in x264.h itself instead of having to
specify it in every location where it's included.

32 hours agocheckasm: Properly save rdx/edx in checkasm_call() on x86
commit | commitdiff | tree
Henrik Gramner [Sun, 16 Aug 2015 21:59:26 +0200 (21:59 +0200)]
checkasm: Properly save rdx/edx in checkasm_call() on x86

If the return value doesn't fit in a single register rdx/edx can in some
cases be used in addition to rax/eax.

Doesn't affect any of the existing checkasm tests but it's more correct
behavior and it might be useful in the future.

32 hours agox86: Enable SSE2 by default on x86-32
commit | commitdiff | tree
Henrik Gramner [Tue, 11 Aug 2015 17:19:35 +0200 (17:19 +0200)]
x86: Enable SSE2 by default on x86-32

It makes more sense to tune the defaults to benefit the vast majority of users.

Anyone still using a Pentium III for video encoding is of course free to
explicitly set different flags when compiling.

32 hours agomsvs/icl: Improve default CFLAGS
commit | commitdiff | tree
Henrik Gramner [Mon, 10 Aug 2015 22:30:21 +0200 (22:30 +0200)]
msvs/icl: Improve default CFLAGS

Use -fp:fast as a substitute for -ffast-math.
Increase warning level from -W0 to -W1 (the default setting).
Disable -GS (stack cookies) on MSVS. It's disabled by default on ICL.

33 hours agoUse a relative $SRCPATH for out-of-tree builds
commit | commitdiff | tree
Henrik Gramner [Wed, 12 Aug 2015 22:23:31 +0200 (22:23 +0200)]
Use a relative $SRCPATH for out-of-tree builds

Fixes out-of-tree MSVS builds on Cygwin.

33 hours agocygwin: Enable MSVS support
commit | commitdiff | tree
Henrik Gramner [Sat, 8 Aug 2015 22:26:38 +0200 (22:26 +0200)]
cygwin: Enable MSVS support

`cl -showIncludes` creates absolute Windows paths for some files, attempt
to convert those to Unix paths.

Use relative paths for dependencies located in or below the working directory
in order to mimic the behavior of gcc and to make the paths more readable.

Make the dependency generation script a bit more robust in general.

33 hours Minor fixes
commit | commitdiff | tree
Henrik Gramner [Sat, 8 Aug 2015 18:34:21 +0200 (18:34 +0200)] Minor fixes

33 hours agoSimplify
commit | commitdiff | tree
Henrik Gramner [Sat, 8 Aug 2015 12:21:54 +0200 (12:21 +0200)]

Also remove some non-POSIX syntax and improve robustness.

As a bonus the script now runs about 2-3 times faster.

`git rev-list --count` could be used to simplify things even further,
but that functionality was added in git 1.7.2 so keep `wc -l` for now
to maintain compatibility with older git versions.

33 hours agomsvs: Fix cl detection in non-English environments
commit | commitdiff | tree
&#51109;&#50689;&#54984; [Fri, 7 Aug 2015 07:43:24 +0200 (14:43 +0900)]
msvs: Fix cl detection in non-English environments

33 hours agox86inc: Sync minor changes from ffmpeg/libav
commit | commitdiff | tree
Henrik Gramner [Mon, 3 Aug 2015 21:05:11 +0200 (21:05 +0200)]
x86inc: Sync minor changes from ffmpeg/libav

33 hours agomatroska: Add comments for the remaining element names
commit | commitdiff | tree
Henrik Gramner [Wed, 29 Jul 2015 19:30:52 +0200 (19:30 +0200)]
matroska: Add comments for the remaining element names

33 hours agoSilence various static analyzer warnings
commit | commitdiff | tree
Henrik Gramner [Wed, 29 Jul 2015 19:30:41 +0200 (19:30 +0200)]
Silence various static analyzer warnings

Those are false positives, but it doesn't hurt to get rid of them.

33 hours agomingw: Enable the tsaware linker flag
commit | commitdiff | tree
Henrik Gramner [Sun, 26 Jul 2015 23:13:29 +0200 (23:13 +0200)]
mingw: Enable the tsaware linker flag

Avoids an irrelevant compatibility layer in Terminal Services environments.

33 hours agomsvs: Don't redefine snprintf for VS2015
commit | commitdiff | tree
Henrik Gramner [Sun, 26 Jul 2015 23:13:26 +0200 (23:13 +0200)]
msvs: Don't redefine snprintf for VS2015

Visual Studio 2015 has a proper snprintf implementation.

33 hours agomsvs: Prefer link.exe from the same directory as cl.exe
commit | commitdiff | tree
Henrik Gramner [Sun, 26 Jul 2015 23:13:19 +0200 (23:13 +0200)]
msvs: Prefer link.exe from the same directory as cl.exe

/usr/bin/link from coreutils may be located before the MSVS linker in $PATH
which causes linking to fail due to using the wrong binary.

33 hours agoframe_dump: check fseek() return value
commit | commitdiff | tree
Henrik Gramner [Mon, 27 Jul 2015 00:10:00 +0200 (00:10 +0200)]
frame_dump: check fseek() return value

33 hours agox264_vfprintf: use va_copy
commit | commitdiff | tree
Henrik Gramner [Mon, 27 Jul 2015 00:08:38 +0200 (00:08 +0200)]
x264_vfprintf: use va_copy

It's undefined behavior to use the same va_list twice.

This most likely didn't cause any issues in practice since the string would
have to be larger than 4 KiB to trigger the fallback path.

Use workaround for ICL as it doesn't define va_copy even for C99.

3 weeks agoparam_parse: Fix framerate rounding issues
commit | commitdiff | tree
Henrik Gramner [Mon, 27 Jul 2015 00:08:31 +0200 (00:08 +0200)]
param_parse: Fix framerate rounding issues

33 hours agoaarch64: Remove broken CFLAGS in configure master
commit | commitdiff | tree
Marcin Juszkiewicz [Mon, 1 Jun 2015 11:24:45 +0200 (11:24 +0200)]
aarch64: Remove broken CFLAGS in configure

GCC doesn't have an "-arch" switch, but works when that entire line is removed.

33 hours agoppc: Add little-endian PowerPC support
commit | commitdiff | tree
Rong Yan [Mon, 20 Jul 2015 10:34:20 +0200 (03:34 -0500)]
ppc: Add little-endian PowerPC support

33 hours agomips: MSA quant optimizations
commit | commitdiff | tree
Rishikesh More [Thu, 18 Jun 2015 14:18:46 +0200 (17:48 +0530)]
mips: MSA quant optimizations

Signed-off-by: Rishikesh More <>

33 hours agomips: MSA predict optimizations
commit | commitdiff | tree
Rishikesh More [Thu, 18 Jun 2015 14:18:45 +0200 (17:48 +0530)]
mips: MSA predict optimizations

Signed-off-by: Rishikesh More <>

33 hours agomips: MSA pixel optimizations
commit | commitdiff | tree
Rishikesh More [Thu, 18 Jun 2015 14:18:44 +0200 (17:48 +0530)]
mips: MSA pixel optimizations

Signed-off-by: Rishikesh More <>

33 hours agomips: MSA deblock optimizations
commit | commitdiff | tree
Rishikesh More [Thu, 18 Jun 2015 14:18:43 +0200 (17:48 +0530)]
mips: MSA deblock optimizations

Signed-off-by: Rishikesh More <>

33 hours agomips: MSA dct optimizations
commit | commitdiff | tree
Rishikesh More [Thu, 18 Jun 2015 14:18:42 +0200 (17:48 +0530)]
mips: MSA dct optimizations

Signed-off-by: Rishikesh More <>

33 hours agomips: MSA mc optimizations
commit | commitdiff | tree
Rishikesh More [Thu, 18 Jun 2015 14:18:40 +0200 (17:48 +0530)]
mips: MSA mc optimizations

Signed-off-by: Rishikesh More <>

33 hours agomips: Common MSA macros
commit | commitdiff | tree
Rishikesh More [Thu, 18 Jun 2015 14:18:38 +0200 (17:48 +0530)]
mips: Common MSA macros

Add macros for load/store, slide, shift, transpose and basic arithmetic
operations required by subsequent patches.

Signed-off-by: Rishikesh More <>

33 hours agomips: Add MSA support to checkasm
commit | commitdiff | tree
Rishikesh More [Tue, 12 May 2015 16:08:09 +0200 (19:38 +0530)]
mips: Add MSA support to checkasm

Signed-off-by: Rishikesh More <>

33 hours agomips: Initial MSA support
commit | commitdiff | tree
Kaustubh Raste [Fri, 17 Apr 2015 14:08:58 +0200 (17:38 +0530)]
mips: Initial MSA support

MSA is the MIPS SIMD Architecture.

Add X264_CPU_MSA define.
Update configure to detect MIPS platform and set flags.
CPU-specific gcc options are expected through --extra-cflags.

Sample command line for mips32r5:
./configure --host=mipsel-linux-gnu --cross-prefix=<TOOLCHAIN>/mips-mti-linux-gnu-
--extra-cflags="-EL -mips32r5 -msched-weight -mload-store-pairs"

Signed-off-by: Kaustubh Raste <>

33 hours agoLimit autodetection of threads number according to the source height
commit | commitdiff | tree
Anton Mitrofanov [Thu, 16 Jul 2015 23:22:29 +0200 (00:22 +0300)]
Limit autodetection of threads number according to the source height

33 hours agoFine-tune of frame's size predictors at ratecontrol start
commit | commitdiff | tree
Anton Mitrofanov [Thu, 16 Jul 2015 18:04:59 +0200 (19:04 +0300)]
Fine-tune of frame's size predictors at ratecontrol start

This is attempt to improve VBV at start of video with a lot of threads which
delay feedback for predictors.

33 hours agoUse forced frame types in slicetype analysis
commit | commitdiff | tree
Anton Mitrofanov [Thu, 16 Jul 2015 15:15:56 +0200 (16:15 +0300)]
Use forced frame types in slicetype analysis

This should improve MBTree and VBV when a lot of forced frame types are used.

33 hours agox86: SSSE3 and AVX2 implementations of plane_copy_swap
commit | commitdiff | tree
Henrik Gramner [Mon, 1 Dec 2014 23:05:42 +0200 (22:05 +0100)]
x86: SSSE3 and AVX2 implementations of plane_copy_swap

For NV21 input.

33 hours agoNV21 input support
commit | commitdiff | tree
Yu Xiaolei [Fri, 6 Jun 2014 10:05:27 +0200 (16:05 +0800)]
NV21 input support

Eliminates an extra copy when encoding Android camera preview images.

Checkasm test by Janne Grunau.
ARM assembly with improvements from Janne Grunau.

33 hours agodeblock: Write combining
commit | commitdiff | tree
Henrik Gramner [Tue, 23 Jun 2015 17:00:47 +0200 (17:00 +0200)]
deblock: Write combining

33 hours agoGet rid of some tabs and trailing whitespaces
commit | commitdiff | tree
Henrik Gramner [Tue, 23 Jun 2015 14:59:59 +0200 (14:59 +0200)]
Get rid of some tabs and trailing whitespaces

33 hours agox86: Experimental nasm support
commit | commitdiff | tree
Henrik Gramner [Sat, 23 May 2015 19:44:16 +0200 (19:44 +0200)]
x86: Experimental nasm support

Enables the use of nasm as an alternative to yasm.

Note that nasm cannot assemble x264 with PIC enabled since it currently doesn't
support [symbol-$$] addressing which is used extensively by x264's PIC code.
This includes all 64-bit Windows and 64-bit OS X builds, even non-shared.

For the above reason nasm is currently intentionally not auto-detected, instead
the assembler must be explicitly specified using "AS=nasm ./configure".

Also drop -O2 from ASFLAGS since it's simply ignored anyway.

33 hours agox86inc: Prevent warnings when using `struc` and `endstruc`
commit | commitdiff | tree
Timothy Gu [Tue, 26 May 2015 19:12:42 +0200 (19:12 +0200)]
x86inc: Prevent warnings when using `struc` and `endstruc`

struc and endstruc attempts to revert to the previous section state set by
the SECTION macro.

Use the primitive [SECTION] directive instead of the SECTION macro for the
.note.GNU-stack section to prevent it from being emitted again during endstruc.

33 hours agox86inc: Drop SECTION_TEXT macro
commit | commitdiff | tree
Henrik Gramner [Wed, 27 May 2015 21:38:14 +0200 (21:38 +0200)]
x86inc: Drop SECTION_TEXT macro

The .text section is already 16-byte aligned by default on all supported
platforms so `SECTION_TEXT` isn't any different from `SECTION .text`.

33 hours agox86inc: Disable vpbroadcastq workaround in newer yasm versions
commit | commitdiff | tree
Henrik Gramner [Sat, 23 May 2015 13:38:05 +0200 (13:38 +0200)]
x86inc: Disable vpbroadcastq workaround in newer yasm versions

The bug was fixed in 1.3.0, so only perform the workaround in earlier versions.

33 hours agoPrefer Unicode versions of Windows API calls
commit | commitdiff | tree
Henrik Gramner [Sun, 24 May 2015 22:57:00 +0200 (22:57 +0200)]
Prefer Unicode versions of Windows API calls

Just for consistency, doesn't affect behavior.

33 hours agoGet rid of fPIC warnings when compiling a shared library on Windows
commit | commitdiff | tree
Henrik Gramner [Sun, 24 May 2015 23:21:20 +0200 (23:21 +0200)]
Get rid of fPIC warnings when compiling a shared library on Windows

PIC is always enabled when compiling for Windows so gcc complains when using
-fPIC since it doesn't do anything.

33 hours agomatroska: Write the correct DocTypeVersion when using frame-packing stable
commit | commitdiff | tree
Henrik Gramner [Sat, 25 Jul 2015 22:42:59 +0200 (22:42 +0200)]
matroska: Write the correct DocTypeVersion when using frame-packing

The StereoMode element is only valid with DocTypeVersion 3 or higher.

2 days agodump_yuv: Fix file handle leak
commit | commitdiff | tree
Anton Mitrofanov [Fri, 24 Jul 2015 23:21:52 +0200 (00:21 +0300)]
dump_yuv: Fix file handle leak

2 days agomp4: Fix file handle leak
commit | commitdiff | tree
Anton Mitrofanov [Fri, 24 Jul 2015 23:20:47 +0200 (00:20 +0300)]
mp4: Fix file handle leak

2 days agoflv: Check fseek() and fwrite() return values
commit | commitdiff | tree
Henrik Gramner [Wed, 24 Jun 2015 00:40:45 +0200 (00:40 +0200)]
flv: Check fseek() and fwrite() return values

2 days agoflv: Fix memory and file handle leaks
commit | commitdiff | tree
Henrik Gramner [Wed, 24 Jun 2015 00:22:56 +0200 (00:22 +0200)]
flv: Fix memory and file handle leaks

2 days agoavs: Fix file handle leak
commit | commitdiff | tree
Henrik Gramner [Wed, 24 Jun 2015 01:23:35 +0200 (01:23 +0200)]
avs: Fix file handle leak

2 days agomatroska: Fix memory leak
commit | commitdiff | tree
Henrik Gramner [Tue, 23 Jun 2015 13:38:02 +0200 (13:38 +0200)]
matroska: Fix memory leak

2 days agordo: Fix potential CAVLC overflow issues
commit | commitdiff | tree
Henrik Gramner [Tue, 23 Jun 2015 13:24:29 +0200 (13:24 +0200)]
rdo: Fix potential CAVLC overflow issues

2 days agoslurp_file: Various minor bug fixes
commit | commitdiff | tree
Henrik Gramner [Tue, 23 Jun 2015 22:08:35 +0200 (22:08 +0200)]
slurp_file: Various minor bug fixes

* Fix unsigned <= 0 check.
* Add additional size sanity check on 32-bit systems.
* Don't read uninitialized data if fread() fails.

2 days agoparam_parse: Check strdup() return value
commit | commitdiff | tree
Henrik Gramner [Tue, 23 Jun 2015 22:47:53 +0200 (22:47 +0200)]
param_parse: Check strdup() return value

10 days agoparam_parse: Fix memory leak
commit | commitdiff | tree
Henrik Gramner [Tue, 23 Jun 2015 15:38:16 +0200 (15:38 +0200)]
param_parse: Fix memory leak

10 days agoAdd FreeBSD's stdint.h header guard to allowed list
commit | commitdiff | tree
Anton Mitrofanov [Fri, 19 Jun 2015 15:01:12 +0200 (16:01 +0300)]
Add FreeBSD's stdint.h header guard to allowed list

Patch written by Koop Mast <>

10 days agox86: Prevent overread of src in plane_copy_interleave
commit | commitdiff | tree
Henrik Gramner [Fri, 22 May 2015 19:23:33 +0200 (19:23 +0200)]
x86: Prevent overread of src in plane_copy_interleave

Could only occur in 4:2:2 with height == 1.

Also enable asm for inputs with different U/V strides as long as the strides
have identical signs.

10 days agocheckasm: Fix incorrect memcmp size for ARM architecture
commit | commitdiff | tree
Anton Mitrofanov [Wed, 20 May 2015 22:10:20 +0200 (23:10 +0300)]
checkasm: Fix incorrect memcmp size for ARM architecture

10 days agoFix possible use of uninitialized MVs in lookahead analysis for B-frames
commit | commitdiff | tree
Anton Mitrofanov [Sun, 26 Apr 2015 19:51:05 +0200 (20:51 +0300)]
Fix possible use of uninitialized MVs in lookahead analysis for B-frames

2 months agoCatch incorrect usage of libx264 API for delayed frames flushing
commit | commitdiff | tree
Anton Mitrofanov [Tue, 21 Apr 2015 22:08:19 +0200 (23:08 +0300)]
Catch incorrect usage of libx264 API for delayed frames flushing

2 months agoFix detection of system libx264 configuration
commit | commitdiff | tree
Anton Mitrofanov [Sat, 7 Mar 2015 22:00:09 +0200 (23:00 +0300)]
Fix detection of system libx264 configuration

2 days agoCosmetic changes master
commit | commitdiff | tree
Anton Mitrofanov [Mon, 23 Feb 2015 12:23:18 +0100 (14:23 +0300)]
Cosmetic changes

2 days agoUpdate configure for auto detection of system libx264 configuration
commit | commitdiff | tree
Anton Mitrofanov [Wed, 31 Dec 2014 00:15:05 +0100 (02:15 +0300)]
Update configure for auto detection of system libx264 configuration

2 days agoAdd tile format frame packing value
commit | commitdiff | tree
Anton Mitrofanov [Tue, 3 Feb 2015 12:51:28 +0100 (14:51 +0300)]
Add tile format frame packing value

Defined in 2014-02 edition.

2 days agoStricter validation of crop-rect values
commit | commitdiff | tree
Anton Mitrofanov [Tue, 3 Feb 2015 11:39:14 +0100 (13:39 +0300)]
Stricter validation of crop-rect values

2 days agoAdd mono frame packing value
commit | commitdiff | tree
Vittorio Giovara [Tue, 20 Jan 2015 17:15:56 +0100 (16:15 +0000)]
Add mono frame packing value

Defined in 2013-04 edition.

2 days agoValidate frame packing value instead of clipping stable
commit | commitdiff | tree
Vittorio Giovara [Tue, 20 Jan 2015 16:57:41 +0100 (15:57 +0000)]
Validate frame packing value instead of clipping

2 days agox86inc: Correctly warn on use of SSE2 instructions in SSE functions
commit | commitdiff | tree
Christophe Gisquet [Tue, 3 Feb 2015 20:40:41 +0100 (20:40 +0100)]
x86inc: Correctly warn on use of SSE2 instructions in SSE functions

SSE2 instructions that are XMM-implementations of pre-existing MMX/MMX2
instructions did not issue warnings when used in SSE functions. Handle
it by also checking the register type when such instructions are used.

2 days agox86inc: Fix instantiation of YMM registers
commit | commitdiff | tree
Christophe Gisquet [Tue, 3 Feb 2015 18:02:30 +0100 (18:02 +0100)]
x86inc: Fix instantiation of YMM registers

2 days agomatroska: Correctly write display width and height in stereo mode
commit | commitdiff | tree
Vittorio Giovara [Tue, 20 Jan 2015 17:28:54 +0100 (16:28 +0000)]
matroska: Correctly write display width and height in stereo mode

According to the specifications, when stereo mode is set, these values
represent the single view size.

2 days agoUse POC type 0 for AVC-Intra
commit | commitdiff | tree
Kieran Kunhya [Tue, 20 Jan 2015 16:38:00 +0100 (09:38 -0600)]
Use POC type 0 for AVC-Intra

Based on a patch from Capella Systems

2 days agoFix ARCH variable name conflict with BSD ports ( read-only variable
commit | commitdiff | tree
Anton Mitrofanov [Sat, 3 Jan 2015 13:46:19 +0100 (15:46 +0300)]
Fix ARCH variable name conflict with BSD ports ( read-only variable

2 days agoFix negative percentages in final stats output
commit | commitdiff | tree
Anton Mitrofanov [Sat, 27 Dec 2014 18:35:39 +0100 (20:35 +0300)]
Fix negative percentages in final stats output

They were caused by integer overflow when encoding long UHD video.

2 days agoBump dates to 2015
commit | commitdiff | tree
Anton Mitrofanov [Sat, 3 Jan 2015 21:35:23 +0100 (23:35 +0300)]
Bump dates to 2015

24 hours agox86: Update intel compiler cpu dispatcher override for new versions of ICC/ICL master
commit | commitdiff | tree
Anton Mitrofanov [Mon, 15 Dec 2014 16:49:23 +0100 (18:49 +0300)]
x86: Update intel compiler cpu dispatcher override for new versions of ICC/ICL

24 hours agoNew AQ mode: auto-variance AQ with bias to dark scenes
commit | commitdiff | tree
Anton Mitrofanov [Tue, 6 Sep 2011 18:53:29 +0100 (21:53 +0400)]
New AQ mode: auto-variance AQ with bias to dark scenes

Also known as --aq-mode 3 or auto-variance AQ modification.

24 hours agoImprove HRD conformance
commit | commitdiff | tree
Anton Mitrofanov [Wed, 29 Aug 2012 00:02:27 +0100 (03:02 +0400)]
Improve HRD conformance

24 hours agox86: SSE and AVX implementations of plane_copy
commit | commitdiff | tree
Henrik Gramner [Fri, 28 Nov 2014 23:24:56 +0100 (23:24 +0100)]
x86: SSE and AVX implementations of plane_copy

Also remove the MMX2 implementation and fix src overread for height == 1.

4 days agoUpdate to the latest version of from
commit | commitdiff | tree
Anton Mitrofanov [Mon, 29 Sep 2014 20:26:19 +0100 (23:26 +0400)]
Update to the latest version of from

Contributions by Janne Grunau, Martin Storsjo, Mans Rullgard, David Conrad, Martin Aumuller and others

4 days agoaarch64: cabac_encode_{decision,bypass,terminal}_asm
commit | commitdiff | tree
Janne Grunau [Wed, 19 Nov 2014 00:33:55 +0100 (00:33 +0100)]
aarch64: cabac_encode_{decision,bypass,terminal}_asm

benchmarks on a Nexus 9 (nvidia denver):
101.3 cycles in x264_cabac_encode_decision_c, 67105369 runs, 3495 skips
97.3 cycles in x264_cabac_encode_decision_asm, 67105493 runs, 3371 skips
132.8 cycles in x264_cabac_encode_terminal_c, 1046950 runs, 1626 skips
116.1 cycles in x264_cabac_encode_terminal_asm, 1048424 runs, 152 skips
92.4 cycles in x264_cabac_encode_bypass_c, 16776192 runs, 1024 skips
89.6 cycles in x264_cabac_encode_bypass_asm, 16776453 runs, 763 skips

Cycle counts are not as stable as one would like. The dynamic code
optimisation seems to produce different results for small chnages in a
binary. Repeated runs with the same binary produce stable results
though (ignoring the first run).

4 days agocheckasm: add cycle counter read for aarch64
commit | commitdiff | tree
Janne Grunau [Thu, 6 Nov 2014 09:20:17 +0100 (09:20 +0100)]
checkasm: add cycle counter read for aarch64

Needs kernel support since user space access to the cycle counter is not
allowed on all available AArch64 systems (Android 5 and iOS).

4 days agoaarch64: nal_escape_neon
commit | commitdiff | tree
Janne Grunau [Wed, 5 Nov 2014 11:35:13 +0100 (11:35 +0100)]
aarch64: nal_escape_neon

3-4 times faster.

4 days agoaarch64: {plane_copy,memcpy_aligned,memzero_aligned}_neon
commit | commitdiff | tree
Janne Grunau [Fri, 31 Oct 2014 14:49:04 +0100 (14:49 +0100)]
aarch64: {plane_copy,memcpy_aligned,memzero_aligned}_neon

2-3 times faster than C.

4 days agoaarch64: x264_mbtree_propagate_{cost,list}_neon
commit | commitdiff | tree
Janne Grunau [Wed, 29 Oct 2014 18:17:48 +0100 (18:17 +0100)]
aarch64: x264_mbtree_propagate_{cost,list}_neon

x264_mbtree_propagate_cost_neon is ~7 times faster.
x264_mbtree_propagate_list_neon is 33% faster.

4 days agoaarch64: x264_denoise_dct_neon
commit | commitdiff | tree
Janne Grunau [Tue, 21 Oct 2014 14:18:49 +0100 (15:18 +0200)]
aarch64: x264_denoise_dct_neon

3.5 times faster.

4 days agoaarch64: x264_coeff_level_run{4,8,15,16}
commit | commitdiff | tree
Janne Grunau [Mon, 20 Oct 2014 12:12:14 +0100 (13:12 +0200)]
aarch64: x264_coeff_level_run{4,8,15,16}

All functions ~33% faster.

4 days agoaarch64: NEON asm for intra luma deblocking
commit | commitdiff | tree
Janne Grunau [Tue, 14 Oct 2014 18:20:52 +0100 (19:20 +0200)]
aarch64: NEON asm for intra luma deblocking

deblock_luma_intra[0]_neon is 2 times fastes,
deblock_luma_intra[1]_neon is ~4 times faster.

4 days agoaarch64: x264_deblock_h_chroma_422_neon
commit | commitdiff | tree
Janne Grunau [Mon, 13 Oct 2014 16:29:22 +0100 (17:29 +0200)]
aarch64: x264_deblock_h_chroma_422_neon

deblock_h_chroma_422 2.5 times faster

4 days agoaarch64: x264_deblock_h_chroma_mbaff_neon
commit | commitdiff | tree
Janne Grunau [Mon, 13 Oct 2014 11:43:50 +0100 (12:43 +0200)]
aarch64: x264_deblock_h_chroma_mbaff_neon

deblock_chroma_420_mbaff_neon 2 times faster

4 days agoaarch64: NEON asm for intra chroma deblocking
commit | commitdiff | tree
Janne Grunau [Fri, 10 Oct 2014 09:29:15 +0100 (10:29 +0200)]
aarch64: NEON asm for intra chroma deblocking

deblock_h_chroma_420_intra, deblock_h_chroma_422_intra and
x264_deblock_h_chroma_intra_mbaff_neon are ~3 times faster.
deblock_chroma_intra[1] is ~4 times faster than C.

4 days agoaarch64: add myself as author to aarch64/mc.h
commit | commitdiff | tree
Janne Grunau [Tue, 2 Sep 2014 09:27:22 +0100 (10:27 +0200)]
aarch64: add myself as author to aarch64/mc.h

4 days agoaarch64: NEON asm for integral init
commit | commitdiff | tree
Janne Grunau [Thu, 14 Aug 2014 14:22:50 +0100 (14:22 +0100)]
aarch64: NEON asm for integral init

integral_init4h_neon and integral_init8h_neon are 3-4 times faster than
C. integral_init8v_neon is 6 times faster and integral_init4v_neon is 10
times faster.

4 days agoaarch64: NEON asm for 8x16c intra prediction
commit | commitdiff | tree
Janne Grunau [Wed, 13 Aug 2014 13:30:53 +0100 (13:30 +0100)]
aarch64: NEON asm for 8x16c intra prediction

Between 10% and 40% faster than C.

4 days agoaarch64: NEON asm for decimate_score
commit | commitdiff | tree
Janne Grunau [Tue, 12 Aug 2014 16:26:10 +0100 (17:26 +0200)]
aarch64: NEON asm for decimate_score

decimate_score15 and 16 are 60% faster, decimate_score64 is 4 times
faster than C.

4 days agoaarch64: implement x264_sub8x16_dct_dc_neon
commit | commitdiff | tree
Janne Grunau [Fri, 8 Aug 2014 11:19:35 +0100 (11:19 +0100)]
aarch64: implement x264_sub8x16_dct_dc_neon

4 times faster than C.

4 days agoaarch64: implement x264_pixel_asd8_neon
commit | commitdiff | tree
Janne Grunau [Thu, 7 Aug 2014 18:46:07 +0100 (19:46 +0200)]
aarch64: implement x264_pixel_asd8_neon

7 times faster than C.

4 days agoaarch64: NEON asm for 4x16 sad, satd and ssd
commit | commitdiff | tree
Janne Grunau [Thu, 7 Aug 2014 15:49:12 +0100 (16:49 +0200)]
aarch64: NEON asm for 4x16 sad, satd and ssd

pixel_sad_4x16_neon: 33% faster than C
pixel_satd_4x16_neon: 5 times faster
pixel_ssd_4x16_neon: 4 times faster

4 days agoaarch64: implement x264_pixel_ssd_nv12_core_neon
commit | commitdiff | tree
Janne Grunau [Wed, 30 Jul 2014 15:48:25 +0100 (15:48 +0100)]
aarch64: implement x264_pixel_ssd_nv12_core_neon

13 times faster than C.

4 days agoaarch64: implement x264_pixel_vsad_neon
commit | commitdiff | tree
Janne Grunau [Tue, 29 Jul 2014 18:26:11 +0100 (18:26 +0100)]
aarch64: implement x264_pixel_vsad_neon

35 times faster than C.

4 days agoaarch64: NEON asm for missing x264_zigzag_* functions
commit | commitdiff | tree
Janne Grunau [Tue, 29 Jul 2014 11:06:24 +0100 (11:06 +0100)]
aarch64: NEON asm for missing x264_zigzag_* functions

zigzag_scan_4x4_field_neon, zigzag_sub_4x4_field_neon,
zigzag_sub_4x4ac_field_neon, zigzag_sub_4x4_frame_neon,
igzag_sub_4x4ac_frame_neon more than 2 times faster

zigzag_scan_8x8_frame_neon, zigzag_scan_8x8_field_neon,
zigzag_sub_8x8_field_neon, zigzag_sub_8x8_frame_neon 4-5 times faster

zigzag_interleave_8x8_cavlc_neon 6 times faster

4 days agoaarch64: implement x264_pixel_sa8d_satd_16x16_neon
commit | commitdiff | tree
Janne Grunau [Fri, 25 Jul 2014 11:53:17 +0100 (11:53 +0100)]
aarch64: implement x264_pixel_sa8d_satd_16x16_neon

~20% faster than calling pixel_sa8d_16x16 and pixel_satd_16x16

4 days agoaarch64: optimize x264_predict_8x8c_dc_left_neon
commit | commitdiff | tree
Janne Grunau [Thu, 14 Aug 2014 22:13:27 +0100 (23:13 +0200)]
aarch64: optimize x264_predict_8x8c_dc_left_neon

25% faster than the previous version.

8 days agox86: Make AVX2 also imply FMA3
commit | commitdiff | tree
Henrik Gramner [Sat, 2 Aug 2014 17:26:18 +0100 (18:26 +0200)]
x86: Make AVX2 also imply FMA3

All CPUs with AVX2 supports FMA3 (but not the other way around).

8 days agoSimplify libx264 API usage example
commit | commitdiff | tree
Anton Mitrofanov [Thu, 13 Nov 2014 20:52:00 +0100 (22:52 +0300)]
Simplify libx264 API usage example

8 days agoAvxSynth: Remove a bunch of unused cruft stable
commit | commitdiff | tree
Henrik Gramner [Fri, 21 Nov 2014 23:47:20 +0100 (23:47 +0100)]
AvxSynth: Remove a bunch of unused cruft

8 days agoFix bugs/typos in motion compensation and cache_load
commit | commitdiff | tree
Anton Mitrofanov [Wed, 3 Dec 2014 20:36:12 +0100 (22:36 +0300)]
Fix bugs/typos in motion compensation and cache_load

Didn't affect output due to the incorrect values either not being used in the
code path or producing equal results compared to the correct values.

Also deduplicate hpel_ref arrays.

2 weeks agocheckasm: Fix undefined behavior warnings
commit | commitdiff | tree
Anton Mitrofanov [Sun, 30 Nov 2014 21:39:28 +0100 (23:39 +0300)]
checkasm: Fix undefined behavior warnings

3 weeks agocheckasm: Fix V210 reporting
commit | commitdiff | tree
Henrik Gramner [Sat, 29 Nov 2014 18:47:52 +0100 (18:47 +0100)]
checkasm: Fix V210 reporting

It would previously report FAILED if any of the earlier plane_copy tests failed.

3 weeks agoSafety check against malicious high bit-depth input which could cause crash master
commit | commitdiff | tree
Anton Mitrofanov [Sun, 12 Oct 2014 18:01:53 +0100 (21:01 +0400)]
Safety check against malicious high bit-depth input which could cause crash

3 weeks agolibx264 API usage example
commit | commitdiff | tree
Anton Mitrofanov [Sun, 12 Oct 2014 17:45:40 +0100 (20:45 +0400)]
libx264 API usage example

3 weeks agox86: AVX2 high bit-depth var_16x16
commit | commitdiff | tree
Henrik Gramner [Fri, 17 Oct 2014 20:35:42 +0100 (21:35 +0200)]
x86: AVX2 high bit-depth var_16x16

40->27 cycles on Haswell.

5 weeks agocheckasm: Serialize read_time() calls on x86
commit | commitdiff | tree
Henrik Gramner [Wed, 8 Oct 2014 21:25:35 +0100 (22:25 +0200)]
checkasm: Serialize read_time() calls on x86

Improves the accuracy of benchmarks, especially in short functions.

To quote the Intel 64 and IA-32 Architectures Software Developer's Manual:
"The RDTSC instruction is not a serializing instruction. It does not necessarily
wait until all previous instructions have been executed before reading the counter.
Similarly, subsequent instructions may begin execution before the read operation
is performed. If software requires RDTSC to be executed only after all previous
instructions have completed locally, it can either use RDTSCP (if the processor
supports that instruction) or execute the sequence LFENCE;RDTSC."

RDTSCP would accomplish the same task, but it's only available since Nehalem.

This change makes SSE2 a requirement to run checkasm.

6 weeks agoSupport case-independent string options
commit | commitdiff | tree
Vittorio Giovara [Mon, 29 Sep 2014 18:51:30 +0100 (18:51 +0100)]
Support case-independent string options

8 weeks agoShut up gcc -Wuninitialized warnings
commit | commitdiff | tree
Anton Mitrofanov [Sat, 6 Sep 2014 17:44:49 +0100 (20:44 +0400)]
Shut up gcc -Wuninitialized warnings

8 weeks agoShut up clang -Wuninitialized warning
commit | commitdiff | tree
Anton Mitrofanov [Fri, 5 Sep 2014 16:43:52 +0100 (19:43 +0400)]
Shut up clang -Wuninitialized warning

8 weeks agoFix few clang -Wunused-* warnings
commit | commitdiff | tree
Anton Mitrofanov [Fri, 5 Sep 2014 16:30:47 +0100 (19:30 +0400)]
Fix few clang -Wunused-* warnings

8 weeks agoFix inappropriate instruction use
commit | commitdiff | tree
Anton Mitrofanov [Thu, 28 Aug 2014 17:13:13 +0100 (20:13 +0400)]
Fix inappropriate instruction use

8 weeks agox264asm: warn when inappropriate instruction used in function with specified cpuflags
commit | commitdiff | tree
Anton Mitrofanov [Thu, 28 Aug 2014 15:38:53 +0100 (18:38 +0400)]
x264asm: warn when inappropriate instruction used in function with specified cpuflags

2 months agoFix VBV with true VFR streams
commit | commitdiff | tree
Anton Mitrofanov [Mon, 1 Sep 2014 22:48:00 +0100 (01:48 +0400)]
Fix VBV with true VFR streams

2 months agoFix VBV
commit | commitdiff | tree
Anton Mitrofanov [Mon, 1 Sep 2014 19:45:00 +0100 (22:45 +0400)]

3 hours agoUpdate to the current lavf API and fix memory leak when using --seek master
commit | commitdiff | tree
Anton Mitrofanov [Wed, 30 Jul 2014 01:03:32 +0200 (03:03 +0400)]
Update to the current lavf API and fix memory leak when using --seek

3 hours agox86inc: Make INIT_CPUFLAGS support an arbitrary number of cpuflags
commit | commitdiff | tree
Henrik Gramner [Tue, 5 Aug 2014 01:42:55 +0200 (01:42 +0200)]
x86inc: Make INIT_CPUFLAGS support an arbitrary number of cpuflags

Previously there was a limit of two cpuflags.

3 hours agox86: Minor pixel_ssim_end4 improvements
commit | commitdiff | tree
Henrik Gramner [Tue, 5 Aug 2014 01:42:51 +0200 (01:42 +0200)]
x86: Minor pixel_ssim_end4 improvements

Reduce the number of vector registers used from 7 to 5.
Eliminate some moves in the AVX implementation.
Avoid bypass delays for transitioning between int and float domains.

3 hours agox86: Faster quant_4x4x4
commit | commitdiff | tree
Henrik Gramner [Tue, 5 Aug 2014 01:42:47 +0200 (01:42 +0200)]
x86: Faster quant_4x4x4

Also drop the MMX version instead of doing a bunch of ifdeffery to support it after this change.

3 hours agoconfigure: improve cc_check for clang and ICL to not ignore unknown options
commit | commitdiff | tree
Anton Mitrofanov [Sun, 10 Aug 2014 20:46:12 +0200 (22:46 +0400)]
configure: improve cc_check for clang and ICL to not ignore unknown options

3 hours agocheckasm: Only call x264_cpu_detect() once
commit | commitdiff | tree
Henrik Gramner [Tue, 5 Aug 2014 01:42:44 +0200 (01:42 +0200)]
checkasm: Only call x264_cpu_detect() once

3 hours agoaarch64: deblocking NEON asm
commit | commitdiff | tree
Janne Grunau [Fri, 18 Jul 2014 15:49:10 +0200 (14:49 +0100)]
aarch64: deblocking NEON asm

Deblock chroma/luma are based on libav's h264 aarch64 NEON deblocking
filter which was ported by me from the existing ARM NEON asm. No
additional persons to ask for a relicense.

3 hours agoaarch64: intra predition NEON asm
commit | commitdiff | tree
Janne Grunau [Fri, 18 Jul 2014 10:29:35 +0200 (09:29 +0100)]
aarch64: intra predition NEON asm

Ported from the ARM NEON asm.

3 hours agoaarch64: motion compensation NEON asm
commit | commitdiff | tree
Janne Grunau [Thu, 17 Jul 2014 16:58:44 +0200 (15:58 +0100)]
aarch64: motion compensation NEON asm

Ported from the ARM NEON asm.

3 hours agoaarch64: transform and zigzag NEON asm
commit | commitdiff | tree
Janne Grunau [Wed, 16 Jul 2014 11:03:52 +0200 (10:03 +0100)]
aarch64: transform and zigzag NEON asm

Ported from the ARM NEON asm.

3 hours agoaarch64: quantization and level-run NEON asm
commit | commitdiff | tree
Janne Grunau [Tue, 15 Jul 2014 13:57:03 +0200 (12:57 +0100)]
aarch64: quantization and level-run NEON asm

Ported from the ARM NEON asm.

3 hours agoaarch64: pixel metrics NEON asm
commit | commitdiff | tree
Janne Grunau [Wed, 19 Mar 2014 14:48:21 +0200 (13:48 +0100)]
aarch64: pixel metrics NEON asm

Ported from the ARM NEON asm.

3 hours agoaarch64: add utility functions for asm
commit | commitdiff | tree
Janne Grunau [Fri, 18 Jul 2014 17:44:57 +0200 (17:44 +0200)]
aarch64: add utility functions for asm

3 hours agoaarch64: add armv8 and neon cpu flags and test them
commit | commitdiff | tree
Janne Grunau [Wed, 19 Mar 2014 14:45:17 +0200 (13:45 +0100)]
aarch64: add armv8 and neon cpu flags and test them

3 hours agoaarch64: initial build support
commit | commitdiff | tree
Janne Grunau [Tue, 18 Mar 2014 23:10:24 +0200 (22:10 +0100)]
aarch64: initial build support

3 hours agocheckasm: test zigzag_sub_8x8_{frame,field}
commit | commitdiff | tree
Janne Grunau [Tue, 22 Jul 2014 19:28:27 +0200 (19:28 +0200)]
checkasm: test zigzag_sub_8x8_{frame,field}

3 hours agoarm: use long multiplication in mc_weight_w*_neon
commit | commitdiff | tree
Janne Grunau [Sun, 20 Jul 2014 18:29:01 +0200 (18:29 +0200)]
arm: use long multiplication in mc_weight_w*_neon

9-19% faster on a cortex-a9.

3 hours agoarm: do not use aligned stores in mc_weight_w4_*neon
commit | commitdiff | tree
Janne Grunau [Sun, 20 Jul 2014 18:24:57 +0200 (18:24 +0200)]
arm: do not use aligned stores in mc_weight_w4_*neon

mc_weight_w4_*neon is also used for width 2 which does not guarantee
4-byte aligned destination. Fixes crashes caused by random memory

3 hours agocheckasm: add memory clobber to read_time inline asm
commit | commitdiff | tree
Janne Grunau [Wed, 2 Apr 2014 16:31:28 +0200 (16:31 +0200)]
checkasm: add memory clobber to read_time inline asm

The memory acts as compiler barrier preventing aggressive reordering
of read_time calls. gcc 4.8 reorders some of initial read_time calls
after the second when targeting arm.

3 hours agoarm: check if the assembler supports the '.func' directive
commit | commitdiff | tree
Janne Grunau [Sun, 20 Jul 2014 13:32:10 +0200 (13:32 +0200)]
arm: check if the assembler supports the '.func' directive

The integrated assembler in llvm trunk (to be released as 3.5) is
otherwise capable enough to assemble the arm asm correctly.

3 hours agoarm/ppc: use $CC as default assembler
commit | commitdiff | tree
Janne Grunau [Sun, 20 Jul 2014 13:40:28 +0200 (13:40 +0200)]
arm/ppc: use $CC as default assembler

3 hours agoarm: move instructions after '.rept' to separate line
commit | commitdiff | tree
Janne Grunau [Sun, 20 Jul 2014 13:34:27 +0200 (13:34 +0200)]
arm: move instructions after '.rept' to separate line

The gas manual states "Repeat the sequence of lines between the .rept
directive and the next .endr directive ...". GNU as seems to support
instructions on the same line as .rept anyway but the integrated
assembler in llvm trunk (to be released 3.5 in August 2014) does not.

3 hours agoarm: set .arch/.fpu from asm.S
commit | commitdiff | tree
Janne Grunau [Sun, 20 Jul 2014 13:08:17 +0200 (13:08 +0200)]
arm: set .arch/.fpu from asm.S

3 hours agoarm: do not append CFLAGS to ASFLAGS
commit | commitdiff | tree
Janne Grunau [Sun, 20 Jul 2014 12:55:53 +0200 (12:55 +0200)]
arm: do not append CFLAGS to ASFLAGS

3 hours agofilters: fix sizeof mismatch stable
commit | commitdiff | tree
Tristan Matthews [Thu, 17 Jul 2014 06:03:50 +0200 (00:03 -0400)]
filters: fix sizeof mismatch

3 hours agoFix memory leak when using select_every filter
commit | commitdiff | tree
Anton Mitrofanov [Thu, 31 Jul 2014 14:17:32 +0200 (16:17 +0400)]
Fix memory leak when using select_every filter

2 hours agoFix on OS X master
commit | commitdiff | tree
Tsukasa OMOTO [Sun, 20 Jul 2014 15:17:11 +0200 (22:17 +0900)]
Fix on OS X

16 hours agoCheck pf_log is set in validate_parameters
commit | commitdiff | tree
Fiona Glaser [Wed, 9 Jul 2014 21:21:33 +0200 (12:21 -0700)]
Check pf_log is set in validate_parameters

Help remind people to call x264_param_default in case they didn't read the

16 hours agoCheck malloc during frame dumping
commit | commitdiff | tree
Anton Mitrofanov [Wed, 9 Jul 2014 15:17:04 +0200 (17:17 +0400)]
Check malloc during frame dumping

16 hours agomp4_lsmash: Use new I/O API instead of deprecated one.
commit | commitdiff | tree
Yusuke Nakamura [Wed, 18 Jun 2014 22:21:29 +0200 (05:21 +0900)]
mp4_lsmash: Use new I/O API instead of deprecated one.

16 hours agoRemove meaningless use of abs()
commit | commitdiff | tree
Anton Mitrofanov [Sun, 8 Jun 2014 20:19:46 +0200 (22:19 +0400)]
Remove meaningless use of abs()

16 hours agoMSVS 2013 Update 2 support
commit | commitdiff | tree
Steven Walters [Sat, 31 May 2014 16:31:16 +0200 (10:31 -0400)]
MSVS 2013 Update 2 support

The first MSVS compiler C99 compliant enough to build x264.
Use `CC=cl ./configure` to compile with it.

16 hours agoconfigure: Add -Wno-maybe-uninitialized to CFLAGS
commit | commitdiff | tree
Diego Biurrun [Tue, 15 Apr 2014 22:54:08 +0200 (22:54 +0200)]
configure: Add -Wno-maybe-uninitialized to CFLAGS

The warnings generated by -Wmaybe-uninitialized are mostly spurious.

16 hours agobuild: Replace by a shell script
commit | commitdiff | tree
Diego Biurrun [Wed, 7 May 2014 13:20:43 +0200 (13:20 +0200)]
build: Replace by a shell script

This avoids a dependency on Perl to build OpenCL support.

16 hours agobuild: Simplify phony target declaration with wildcards
commit | commitdiff | tree
Diego Biurrun [Tue, 15 Apr 2014 23:02:39 +0200 (23:02 +0200)]
build: Simplify phony target declaration with wildcards

Also add etags to list of phony targets.

16 hours agoconfigure: Drop workaround for obsolete gcc 4.2 on ARM
commit | commitdiff | tree
Diego Biurrun [Wed, 7 May 2014 12:47:37 +0200 (12:47 +0200)]
configure: Drop workaround for obsolete gcc 4.2 on ARM

16 hours agobuild: Add dependencies on x86inc.asm/x86util.asm for all .asm files
commit | commitdiff | tree
Diego Biurrun [Wed, 7 May 2014 21:43:15 +0200 (21:43 +0200)]
build: Add dependencies on x86inc.asm/x86util.asm for all .asm files

This is a little bit overzealous, but errs on the side of caution.
Generating full dependency information is also possible, but slightly
slows down the build as YASM cannot do it as a sideeffect of compilation.

16 hours agoDelete all SPARC optimizations
commit | commitdiff | tree
Diego Biurrun [Sun, 27 Apr 2014 21:09:54 +0200 (21:09 +0200)]
Delete all SPARC optimizations

SPARC has been obsolete for a long time and makes little sense as a
H.264 encoding platform.

Also update authors file.

16 hours agoconfigure: Don't check for libavcore
commit | commitdiff | tree
Diego Biurrun [Wed, 7 May 2014 12:46:42 +0200 (12:46 +0200)]
configure: Don't check for libavcore

libavcore was a never-released bad idea with a short lifespan.

16 hours agobuild: Set all ASFLAGS from within configure
commit | commitdiff | tree
Diego Biurrun [Sun, 27 Apr 2014 23:19:04 +0200 (23:19 +0200)]
build: Set all ASFLAGS from within configure

This is how all other toolchain flags are handled.

16 hours agoopencl: Check return value of fread()
commit | commitdiff | tree
Diego Biurrun [Sun, 27 Apr 2014 23:23:49 +0200 (23:23 +0200)]
opencl: Check return value of fread()

common/opencl.c:138:10: warning: ignoring return value of 'fread', declared with attribute warn_unused_result [-Wunused-result]

16 hours agoDisable i8x8 in lossless stable
commit | commitdiff | tree
Fiona Glaser [Sun, 20 Jul 2014 05:34:22 +0200 (20:34 -0700)]
Disable i8x8 in lossless

x264's implementation was slightly incorrect due to a vague spec, so some
decoders decoded video incorrectly.

Minimal impact on compression.

16 hours agoAVC-Intra: fix compatibility with Avid Transfermanager
commit | commitdiff | tree
Thomas Mundt [Fri, 27 Jun 2014 20:12:06 +0200 (11:12 -0700)]
AVC-Intra: fix compatibility with Avid Transfermanager

16 hours agox86: Fix SIGILL in high bit-depth intra_sad_x3_4x4_sse2
commit | commitdiff | tree
Henrik Gramner [Tue, 8 Jul 2014 21:15:32 +0200 (21:15 +0200)]
x86: Fix SIGILL in high bit-depth intra_sad_x3_4x4_sse2

An SSE3 instruction was used in an SSE2 function.

16 hours agoFix incorrect row predictor addressing
commit | commitdiff | tree
Anton Mitrofanov [Wed, 9 Jul 2014 15:01:54 +0200 (17:01 +0400)]
Fix incorrect row predictor addressing

Somehow managed to not cause things to explode, but was clearly incorrect.
Might improve VBV in some cases to have this working right.

16 hours agoFix b-pyramid MMCO remove for frame-packing==5
commit | commitdiff | tree
Anton Mitrofanov [Sat, 21 Jun 2014 21:52:39 +0200 (23:52 +0400)]
Fix b-pyramid MMCO remove for frame-packing==5

16 hours agoFix frame-packing==5 with some decoders
commit | commitdiff | tree
Tal Aloni [Wed, 18 Jun 2014 00:10:56 +0200 (15:10 -0700)]
Fix frame-packing==5 with some decoders

The spec mandates that frame-packing==5 requires the SEI on every frame that
begins a view sequence (i.e. the input frames L0-R0-L1-R1 have 4 view sequences,
but if reordered by the encoder to L0-L1-R0-R1 there are now 2 view sequences).
For simplicity, we write the SEI on every frame.

This fixes frame-packing==5 3D playback on some decoders (PlayStation 3, Sony
W8 series, possibly others).

4 weeks agoFix pixel_ssim_end4 asm function for x86_64 systems
commit | commitdiff | tree
Anton Mitrofanov [Thu, 22 May 2014 11:27:00 +0200 (13:27 +0400)]
Fix pixel_ssim_end4 asm function for x86_64 systems

8 hours agox86: XOP pixel_sad_{x3, x4} high bit-depth master
commit | commitdiff | tree
James Almer [Wed, 9 Apr 2014 08:33:06 +0200 (03:33 -0300)]
x86: XOP pixel_sad_{x3, x4} high bit-depth

8 hours agox86: XOP pixel_ssd_nv12_core
commit | commitdiff | tree
James Almer [Wed, 9 Apr 2014 08:33:05 +0200 (03:33 -0300)]
x86: XOP pixel_ssd_nv12_core

8 hours agox86util: XOP optimized HADDD
commit | commitdiff | tree
James Almer [Wed, 9 Apr 2014 08:33:04 +0200 (03:33 -0300)]
x86util: XOP optimized HADDD

8 hours agox86: add missing initialization for high bit-depth sa8d_satd
commit | commitdiff | tree
James Almer [Wed, 9 Apr 2014 08:33:03 +0200 (03:33 -0300)]
x86: add missing initialization for high bit-depth sa8d_satd

8 hours agox86: add missing initializations for high bit-depth variance
commit | commitdiff | tree
James Almer [Sun, 6 Apr 2014 04:46:31 +0200 (23:46 -0300)]
x86: add missing initializations for high bit-depth variance

8 hours agoarm: use the weight_fn_t typedef for mc weight function arrays
commit | commitdiff | tree
Janne Grunau [Tue, 1 Apr 2014 22:11:45 +0200 (22:11 +0200)]
arm: use the weight_fn_t typedef for mc weight function arrays

8 hours agoarm: correct x264_mc_chroma_neon function declaration
commit | commitdiff | tree
Janne Grunau [Tue, 1 Apr 2014 22:11:44 +0200 (22:11 +0200)]
arm: correct x264_mc_chroma_neon function declaration

8 hours agoarm: do not export every asm function
commit | commitdiff | tree
Janne Grunau [Tue, 1 Apr 2014 22:11:43 +0200 (22:11 +0200)]
arm: do not export every asm function

Based on Libav's libavutil/arm/asm.S. Also prevents having the same
label twice for every function on systems not defining EXTERN_ASM.
Clang's integrated assembler does not like it.

8 hours agoarm: move all .macro/.endm to column 0
commit | commitdiff | tree
Janne Grunau [Tue, 1 Apr 2014 22:11:42 +0200 (22:11 +0200)]
arm: move all .macro/.endm to column 0

8 hours agoaarch64: require PIC in shared mode
commit | commitdiff | tree
William Grant [Sun, 23 Mar 2014 18:21:52 +0200 (09:21 -0700)]
aarch64: require PIC in shared mode

8 hours agoarm: x264_coeff_last8_arm
commit | commitdiff | tree
Janne Grunau [Sun, 16 Mar 2014 18:21:58 +0200 (17:21 +0100)]
arm: x264_coeff_last8_arm

checkasm --bench on a coretex-a9:
coeff_last8_c: 173
coeff_last8_armv6: 151

60 instead of 73 cycles in ~130k runs on the same cpu while encoding.

8 hours agoarm: x264_store_interleave_chroma_neon
commit | commitdiff | tree
Janne Grunau [Sat, 15 Mar 2014 21:09:18 +0200 (20:09 +0100)]
arm: x264_store_interleave_chroma_neon

store_interleave_chroma_c: 4036
store_interleave_chroma_neon: 1043

8 hours agoarm: x264_plane_copy_interleave_neon
commit | commitdiff | tree
Janne Grunau [Sat, 15 Mar 2014 20:55:50 +0200 (19:55 +0100)]
arm: x264_plane_copy_interleave_neon

plane_copy_interleave_c: 40285
plane_copy_interleave_neon: 10137

8 hours agoarm: x264_plane_copy_deinterleave_rgb_neon
commit | commitdiff | tree
Janne Grunau [Sat, 15 Mar 2014 20:21:12 +0200 (19:21 +0100)]
arm: x264_plane_copy_deinterleave_rgb_neon

plane_copy_deinterleave_rgb_c: 31543
plane_copy_deinterleave_rgb_neon: 8312

8 hours agoarm: load_deinterleave_chroma_f{dec,enc}_neon
commit | commitdiff | tree
Janne Grunau [Sat, 15 Mar 2014 19:22:49 +0200 (18:22 +0100)]
arm: load_deinterleave_chroma_f{dec,enc}_neon

load_deinterleave_chroma_fdec_c: 4055
load_deinterleave_chroma_fdec_neon: 995
load_deinterleave_chroma_fenc_c: 4071
load_deinterleave_chroma_fenc_neon: 992

8 hours agoarm: x264_plane_copy_deinterleave_neon
commit | commitdiff | tree
Janne Grunau [Sat, 15 Mar 2014 18:22:08 +0200 (17:22 +0100)]
arm: x264_plane_copy_deinterleave_neon

plane_copy_deinterleave_c: 42988
plane_copy_deinterleave_neon: 10184

8 hours agoarm: implement deblock_strength_neon
commit | commitdiff | tree
Janne Grunau [Sat, 15 Mar 2014 14:29:41 +0200 (13:29 +0100)]
arm: implement deblock_strength_neon

Based on deblock_strength_avx.

checkasm --bench on a cortex-a9:
deblock_strength_c: 14611
deblock_strength_neon: 1848

8 hours agoarm: add missing macro instantiation for x264_pixel_avg_4x16_neon
commit | commitdiff | tree
Janne Grunau [Sat, 15 Mar 2014 11:51:11 +0200 (10:51 +0100)]
arm: add missing macro instantiation for x264_pixel_avg_4x16_neon

checkasm --bench on a cortex-a9:
avg_4x16_c: 8910
avg_4x16_neon: 2091

8 hours agoarm: implement x264_predict_4x4_v_armv6
commit | commitdiff | tree
Janne Grunau [Thu, 13 Mar 2014 02:02:13 +0200 (01:02 +0100)]
arm: implement x264_predict_4x4_v_armv6

Alone probably not worth it but allows use of predict_4x4_dc|h_armv6
in intra_sad|satd_x3_4x4_neon.

8 hours agoppc: fix build on certain PowerPC variants without Altivec stable
commit | commitdiff | tree
Roland Stigge [Sun, 23 Mar 2014 18:29:37 +0200 (09:29 -0700)]
ppc: fix build on certain PowerPC variants without Altivec

8 hours agoOnly add strip option '-s' for linker flags
commit | commitdiff | tree
Anton Mitrofanov [Mon, 21 Apr 2014 22:58:24 +0200 (00:58 +0400)]
Only add strip option '-s' for linker flags

Fixes some build warnings with clang.

34 hours agoconfigure: remove an unnecessary option from CFLAGS on OS X
commit | commitdiff | tree
Tsukasa OMOTO [Sat, 15 Mar 2014 09:53:53 +0200 (16:53 +0900)]
configure: remove an unnecessary option from CFLAGS on OS X

Fixes Clang 3.4 compilation on OS X.

4 hours agoMacroblock tree overhaul/optimization master
Jason Garrett-Glaser [Sun, 23 Feb 2014 19:36:55 +0100 (10:36 -0800)]
Macroblock tree overhaul/optimization

Move the second core part of macroblock tree into an assembly function;
SIMD-optimize roughly half of it (for x86). Roughly ~25-65% faster mbtree,
depending on content.

Slightly change how mbtree handles the tradeoff between range and precision
for propagation.

Overall a slight (but mostly negligible) effect on SSIM and ~2% faster.

4 hours agoarm: use available neon functions for intra_sa8d/sad/satd_x3
Janne Grunau [Thu, 13 Mar 2014 00:05:48 +0100 (00:05 +0100)]
arm: use available neon functions for intra_sa8d/sad/satd_x3

4% faster on main/medium, 15% faster on baseline/superfast on a cortex-a9.

26 hours agoarm: implement x264_pixel_var2_8x16_neon
Janne Grunau [Wed, 12 Mar 2014 14:35:31 +0100 (14:35 +0100)]
arm: implement x264_pixel_var2_8x16_neon

checkasm --bench on a cortex-a9:
var2_8x16_c: 5677
var2_8x16_neon: 1421

26 hours agoarm: implement x264_pixel_var_8x16_neon
Janne Grunau [Wed, 12 Mar 2014 13:16:00 +0100 (13:16 +0100)]
arm: implement x264_pixel_var_8x16_neon

checkasm --bench on a cortex-a9:
var_8x16_c: 4306
var_8x16_neon: 791

42 hours agox86: SSE2 and SSSE3 plane_copy_deinterleave_rgb
Henrik Gramner [Sun, 23 Feb 2014 15:33:48 +0100 (15:33 +0100)]
x86: SSE2 and SSSE3 plane_copy_deinterleave_rgb

About 5.6x faster than C on Haswell.

42 hours agox86: Minor mbtree_propagate_cost improvements
Henrik Gramner [Sun, 16 Feb 2014 21:24:54 +0100 (21:24 +0100)]
x86: Minor mbtree_propagate_cost improvements

Reduce the number of registers used from 7 to 6.
Reduce the number of vector registers used by the AVX2 implementation from 8 to 7.
Multiply fps_factor by 1/256 once per frame instead of once per macroblock row.
Use mova instead of movu for dst since it's guaranteed to be aligned.
Some cosmetics.

42 hours agox86inc: Support arbitrary stack alignments
Henrik Gramner [Sun, 9 Feb 2014 23:58:04 +0100 (23:58 +0100)]
x86inc: Support arbitrary stack alignments

If the stack is known to be at least 32-byte aligned we can safely store ymm
registers on the stack without doing manual alignment.

Change ALLOC_STACK to always align the stack before allocating stack space for
consistency. Previously alignment would occur either before or after allocating
stack space depending on whether manual alignment was required or not.

42 hours agox86inc: warn if XOP integer FMA instruction emulation is impossible
Anton Mitrofanov [Fri, 14 Feb 2014 12:53:58 +0100 (15:53 +0400)]
x86inc: warn if XOP integer FMA instruction emulation is impossible

Emulation requires a temporary register if arguments 1 and 4 are the same; this
doesn't obey the semantics of the original instruction, so we can't emulate
that in x86inc.

ffmpeg has an x86util emulation for that case; I'll add it if x264's asm ever
needs it.

Also add pmacsdql emulation.

42 hours agox86inc: free up variable name "n" in global namespace
Loren Merritt [Sat, 1 Mar 2014 03:57:56 +0100 (02:57 +0000)]
x86inc: free up variable name "n" in global namespace

42 hours agox86: Pass -Worphan-labels to yasm
Henrik Gramner [Wed, 22 Jan 2014 19:09:12 +0100 (19:09 +0100)]
x86: Pass -Worphan-labels to yasm

Makes it easier to detect typos.

42 hours agoWrite 3D metadata when outputting Matroska
Steve Lhomme [Sun, 16 Feb 2014 13:15:09 +0100 (13:15 +0100)]
Write 3D metadata when outputting Matroska

For when --frame-packing is set.

42 hours agoDon't set chroma_loc_info_present_flag for non-4:2:0
Anton Mitrofanov [Sun, 23 Feb 2014 13:56:03 +0100 (16:56 +0400)]
Don't set chroma_loc_info_present_flag for non-4:2:0

The H.264 spec says it shouldn't be set in these cases.

42 hours agox264.h: fix documentation stable
Jason Garrett-Glaser [Mon, 10 Mar 2014 16:42:50 +0100 (08:42 -0700)]
x264.h: fix documentation

The full details of the return values of encoder_encode and encoder_headers
were mistakenly removed a while ago; re-add them.

42 hours agoFix pointer cast warning for 64-bit builds
Anton Mitrofanov [Sun, 23 Feb 2014 12:52:57 +0100 (15:52 +0400)]
Fix pointer cast warning for 64-bit builds

2 days agombaff: fix mb_field_decoding_flag tracking and simplify allow skip check
Anton Mitrofanov [Mon, 10 Mar 2014 13:48:02 +0100 (16:48 +0400)]
mbaff: fix mb_field_decoding_flag tracking and simplify allow skip check

Fixes an issue with too many forced non-skips in mbaff+cavlc, as well as
non-deterministic output with mbaff+cavlc+sliced-threads.

2 days agoFix memory overwrite in x264_deblock_h_chroma_mbaff_sse2
Anton Mitrofanov [Mon, 10 Mar 2014 00:22:57 +0100 (03:22 +0400)]
Fix memory overwrite in x264_deblock_h_chroma_mbaff_sse2

Fixes possible corruption with MBAFF+sliced threads.

2 days agoFix corruption with CAVLC overflow handling in MBAFF+main profile
Jason Garrett-Glaser [Sun, 2 Mar 2014 19:09:01 +0100 (10:09 -0800)]
Fix corruption with CAVLC overflow handling in MBAFF+main profile

Probably a regression in 83561e.

2 days agoFix checkasm --bench output when nop_cycles is too large
Anton Mitrofanov [Mon, 10 Mar 2014 18:17:19 +0100 (21:17 +0400)]
Fix checkasm --bench output when nop_cycles is too large

2 weeks agoReally fix quantization factor allocation
Anton Mitrofanov [Wed, 22 Jan 2014 09:54:49 +0100 (12:54 +0400)]
Really fix quantization factor allocation

Actually allocate less (instead of just initialize less) and fix comments.

2 weeks agoFix build with Android NDK
Yu Xiaolei [Sun, 23 Feb 2014 13:12:51 +0100 (04:12 -0800)]
Fix build with Android NDK

Android NDK does not expose sched_getaffinity.

10 hours agox86inc: speed up compilation with yasm master
Loren Merritt [Thu, 16 Jan 2014 22:34:46 +0100 (13:34 -0800)]
x86inc: speed up compilation with yasm

Work around yasm's inefficiency with handling large numbers of variables
in the global scope.

10 hours agoAdd support for AVC-Intra Class 200
Kieran Kunhya [Sat, 11 Jan 2014 00:27:33 +0100 (23:27 +0000)]
Add support for AVC-Intra Class 200

10 hours agov210 input support
James Weaver [Tue, 7 Jan 2014 11:31:58 +0100 (10:31 +0000)]
v210 input support

Assembly based on code by Henrik Gramner and Loren Merritt.

10 hours agoFix quantization factor allocation
Jason Garrett-Glaser [Tue, 21 Jan 2014 22:39:33 +0100 (13:39 -0800)]
Fix quantization factor allocation

We don't need to wastefully allocate quant tables above QP_MAX_SPEC; they're
never used.

13 days agoAvoid some unneccesary memory loads in macroblock_encode
Henrik Gramner [Wed, 8 Jan 2014 01:06:56 +0100 (01:06 +0100)]
Avoid some unneccesary memory loads in macroblock_encode

13 days agoBump dates to 2014
Henrik Gramner [Sun, 5 Jan 2014 15:25:05 +0100 (15:25 +0100)]
Bump dates to 2014

Also update AUTHORS file and my e-mail address in the headers of various files.

13 days agoRemove tools/xyuv.c
Henrik Gramner [Mon, 6 Jan 2014 00:18:31 +0100 (00:18 +0100)]
Remove tools/xyuv.c

It's an old stand-alone application that isn't relevant to x264.

13 days agoUse 8x16c wrappers with x86 asm functions for 4:2:2 with high bit depth
Anton Mitrofanov [Wed, 6 Nov 2013 23:37:23 +0100 (02:37 +0400)]
Use 8x16c wrappers with x86 asm functions for 4:2:2 with high bit depth

13 days agoCLI: Avoid redundant 16-bit upconversions in piped raw input
Henrik Gramner [Fri, 20 Dec 2013 22:44:28 +0100 (22:44 +0100)]
CLI: Avoid redundant 16-bit upconversions in piped raw input

It's not possible to seek in pipes, so if we want to skip frames we have to read and
discard unused ones. It's pointless to do bit-depth upconversions in those frames.

13 days agoFix input support from named pipes in Windows stable
Anton Mitrofanov [Fri, 3 Jan 2014 17:06:06 +0100 (20:06 +0400)]
Fix input support from named pipes in Windows

13 days agoFix ARM asm compilation with Apple assembler
Steve Clark [Wed, 20 Nov 2013 18:40:23 +0100 (21:40 +0400)]
Fix ARM asm compilation with Apple assembler

2 weeks agoFix uninitialized variable
Anton Mitrofanov [Wed, 13 Nov 2013 16:24:48 +0100 (19:24 +0400)]
Fix uninitialized variable

Caused if the timebase is not specified in stats file. Found by Clang.

14 hours agoRemove --visualize option. master
Anton Mitrofanov [Sun, 27 Oct 2013 16:27:23 +0100 (19:27 +0400)]
Remove --visualize option.

It probably wasn't used or maintained for last few years.

14 hours agoAdd L-SMASH support as preferable alternative for MP4-muxing
Anton Mitrofanov [Tue, 15 Oct 2013 09:32:25 +0100 (12:32 +0400)]
Add L-SMASH support as preferable alternative for MP4-muxing

14 hours agoAdd AVC-Intra 1080p50/60 Class 100 parameters
Kieran Kunhya [Sat, 21 Sep 2013 19:16:12 +0100 (19:16 +0100)]
Add AVC-Intra 1080p50/60 Class 100 parameters

Also add some compatibility fixes.

14 hours agoAdd --filler option
Jason Garrett-Glaser [Mon, 9 Sep 2013 20:37:59 +0100 (12:37 -0700)]
Add --filler option

Allows generation of hard-CBR streams without using NAL HRD.
Useful if you want to be able to reconfigure the bitrate (which you can't do
with NAL HRD on).

14 hours agoMake x264_encoder_reconfig more threadsafe
Anton Mitrofanov [Sun, 27 Oct 2013 12:22:51 +0100 (15:22 +0400)]
Make x264_encoder_reconfig more threadsafe

Do the reconfig when the next frame's encode begins.
Fixes some rare crashes with frame-threading and encoder_reconfig.

5 days agochroma-me: take shortcut in BI analysis
Jason Garrett-Glaser [Fri, 25 Oct 2013 01:19:00 +0100 (17:19 -0700)]
chroma-me: take shortcut in BI analysis

~100 cycles faster with subme>=9

5 days agoCRF-max: don't warn if VBV underflow occurs
Jason Garrett-Glaser [Thu, 24 Oct 2013 22:44:43 +0100 (14:44 -0700)]
CRF-max: don't warn if VBV underflow occurs

Only warn if underflow occurs for reasons other than CRF-max, as CRF-max
implies that VBV underflow is desired by the user.

5 days agox86inc: Make ym# behave the same way as xm#
Henrik Gramner [Fri, 18 Oct 2013 21:43:36 +0100 (22:43 +0200)]
x86inc: Make ym# behave the same way as xm#

This makes more sense for future implementations of templates with zmm registers.

5 days agoUse calloc instead of malloc + memset
Henrik Gramner [Fri, 18 Oct 2013 21:21:38 +0100 (22:21 +0200)]
Use calloc instead of malloc + memset

5 days agoReplace gf_malloc with regular malloc in mp4 muxer
Henrik Gramner [Thu, 10 Oct 2013 15:54:12 +0100 (16:54 +0200)]
Replace gf_malloc with regular malloc in mp4 muxer

It was used as a workaround for a bug that only existed in the GPAC repository
for a few weeks back in 2010. There's no reason to keep it anymore.

5 days agoUpdate to current libav/ffmpeg API
Anton Mitrofanov [Tue, 8 Oct 2013 20:20:40 +0100 (23:20 +0400)]
Update to current libav/ffmpeg API

5 days change to use /bin/sh
Rafaël Carré [Fri, 25 Oct 2013 15:12:24 +0100 (07:12 -0700)] change to use /bin/sh

6 days agoconfigure: don't generate a git version number if .git isn't present
Sean McGovern [Wed, 4 Sep 2013 22:15:00 +0100 (14:15 -0700)]
configure: don't generate a git version number if .git isn't present

6 days agoconfigure: include dependency libs in the Libs pkg-config
Martin Storsjo [Tue, 3 Sep 2013 22:56:18 +0100 (14:56 -0700)]
configure: include dependency libs in the Libs pkg-config

If only a static library is built, the user of the library that just
tries to link to the lib using the flags provided by pkg-config
might not know that only a static lib exists and that he'd have to
pass --static to pkg-config to get the internal dependencies to
be able to link the library.

For a shared build, the internal dependencies are kept in Libs.private
as before.

This matches how libav's pkg-config files are generated.

6 days agoFix compilation in case of HAVE_LOG2F check fails spuriously stable
Anton Mitrofanov [Thu, 17 Oct 2013 21:38:06 +0100 (00:38 +0400)]
Fix compilation in case of HAVE_LOG2F check fails spuriously

6 days agoFix compilation of shared library for Windows with original MinGW toolchain
Anton Mitrofanov [Sat, 12 Oct 2013 09:01:57 +0100 (12:01 +0400)]
Fix compilation of shared library for Windows with original MinGW toolchain

6 days agoFix possible crashes in resize and crop filters with high bitdepth input
Anton Mitrofanov [Tue, 8 Oct 2013 20:32:37 +0100 (23:32 +0400)]
Fix possible crashes in resize and crop filters with high bitdepth input

8 weeks agoFix INSTALL in configure for Solaris systems
Tim Mooney [Tue, 3 Sep 2013 21:43:50 +0100 (13:43 -0700)]
Fix INSTALL in configure for Solaris systems

2 months agoWorkaround for FFMS indexing bug
Henrik Gramner [Tue, 27 Aug 2013 23:50:31 +0100 (00:50 +0200)]
Workaround for FFMS indexing bug

If FFMS_ReadIndex is used with an empty index file it gets stuck in an infinite loop instead of returning NULL
like it's supposed to do on failure. Explicitly check if the file is empty before calling it as a workaround.

Anton Mitrofanov [Mon, 26 Aug 2013 19:20:31 +0200 (21:20 +0400)]
Fix masked access violation in KERNEL32

Caused crashes under gdb in Windows and might cause other unknown problems.

Hiroki Taniura [Sat, 24 Aug 2013 18:18:57 +0200 (01:18 +0900)]
Fix GPAC support on Windows

Henrik Gramner [Sun, 11 Aug 2013 19:50:42 +0200 (19:50 +0200)]
Windows Unicode support

Windows, unlike most other operating systems, uses UTF-16 for Unicode strings while x264 is designed for UTF-8.

This patch does the following in order to handle things like Unicode filenames:
* Keep strings internally as UTF-8.
* Retrieve the CLI command line as UTF-16 and convert it to UTF-8.
* Always use Unicode versions of Windows API functions and convert strings to UTF-16 when calling them.
* Attempt to use legacy 8.3 short filenames for external libraries without Unicode support.

Kieran Kunhya [Sat, 20 Jul 2013 19:47:59 +0200 (18:47 +0100)]
AVC-Intra support

This format has been reverse engineered and x264's output has almost exactly
the same bitstream as Panasonic cameras and encoders produce. It therefore does
not comply with SMPTE RP2027 since Panasonic themselves do not comply with
their own specification. It has been tested in Avid, Premiere, Edius and

Parts of this patch were written by Jason Garrett-Glaser and some reverse
engineering was done by Joseph Artsimovich.

Henrik Gramner [Mon, 8 Jul 2013 21:06:42 +0200 (12:06 -0700)]
Transparent hugepage support

Combine frame and mb data mallocs into a single large malloc.
Additionally, on Linux systems with hugepage support, ask for hugepages on
large mallocs.

This gives a small performance improvement (~0.2-0.9%) on systems without
hugepage support, as well as a small memory footprint reduction.

On recent Linux kernels with hugepage support enabled (set to madvise or
always), it improves performance up to 4% at the cost of about 7-12% more
memory usage on typical settings..

It may help even more on Haswell and other recent CPUs with improved 2MB page
support in hardware.

x86: SSSE3 implementation of pixel_sad_x3 and pixel_sad_x4

x86: Faster AVX2 pixel_sad_x3 and pixel_sad_x4

x86: Remove X264_CPU_SSE_MISALIGN functions

Prevents a crash if the misaligned exception mask bit is cleared for some reason.

Misaligned SSE functions are only used on AMD Phenom CPUs and the benefit is miniscule.
They also require modifying the MXCSR control register and by removing those functions
we can get rid of that complexity altogether.

VEX-encoded instructions also supports unaligned memory operands. I tried adding AVX
implementations of all removed functions but there were no performance improvements on
Ivy Bridge. pixel_sad_x3 and pixel_sad_x4 had significant code size reductions though
so I kept them and added some minor cosmetics fixes and tweaks.

Tweak i16x16-delta-quant-avoidance code

Don't omit the delta quant if it'd raise the quantizer to do so; this fixes
a rare flickering issue caused by deblocking.

x86: faster AVX2 iDCT, AVX deblock_luma_h, deblock_luma_h_intra

Add new color primaries, transfer characteristics, matrix coefficients

Add "--stitchable" option for segmented encoding

Stops x264 from attempting to optimize global stream headers, ensuring that
different segments of a video will have identical headers when used with
identical encoding settings.

Interface: if vbv-maxrate < bitrate, set bitrate = vbv-maxrate

This probably makes more sense to the user than setting vbv-maxrate = bitrate,
as before.

OpenCL cosmetics

Fix possible crash when writing very large filler NALUs

Bitstream-reallocation function didn't handle the case of filler.

Fix build with PIC on some systems

Fix potential misaligment crash in AVX2 denoise_dct

Fix building with compilers without inline asm support

Also fix crash in high bit depth builds compiled with unaligned stack.

Fix compilation with OpenCL on MacOS X

Also fix crash in the case of OpenCL error during encoding.

OpenCL support improvement/refactoring

Autoload the OpenCL library so that it's not required to run an openCL-enabled
build of x264.

Update X264_BUILD, which should have been changed with the first patch.

x86: shave a few instructions off AVX deblock

x86: AVX2 dequant_4x4_dc

x86: AVX2 high bit-depth dequant

x86-64: 64-bit variant of AVX2 hpel_filter

~5% faster than 32-bit.

x86: AVX2 high bit-depth denoise_dct

28->15 cycles

Also reorder instructions to use fewer registers, 3 cycles faster on Ivy Bridge with 64-bit Windows.

x86: AVX2 high bit-depth quant

quant_4x4: 13->6 cycles
quant_4x4_dc: 14->8 cycles
quant_8x8: 47->24 cycles
quant_4x4x4: 48->25 cycles

x86: AVX2 add16x16_idct_dc

27 -> 19 cycles

x86: faster AVX2 quant_4x4x4

10->9 cycles

x86: AVX2 intra_sad_x3_8x8c

30->22 cycles

x86: AVX2 high bit-depth intra_sad_x3_8x8

43->24 cycles

x86: AVX2 deblock strength

30->18 cycles

x86: Faster high bit-depth intra_sad_x3_4x4

20->16 cycles on Ivy Bridge

x86: faster SSSE3 hpel

~7% faster using the pmulhrsw trick from mc_chroma.

x86-64: faster SSSE3 trellis

~2% faster trellis.

x86: 32-byte align the stack if possible

Avoids the need for manual 32 byte array alignment on compilers that support

x86inc: Utilize the shadow space on 64-bit Windows

Store XMM6 and XMM7 in the shadow space in functions that clobbers them.
This way we don't have to adjust the stack pointer as often,
reducing the number of instructions as well as code size.

x86: Don't use explicitly aligned versions of SAD on AVX CPUs

On modern CPUs movdqu isn't slower than movdqa when used on aligned data and using the same code in both cases saves cache.

This was already done for the high bit-depth AVX2 implementation but the aligned version still exists as dead code so remove that.

x86: Add missing initializations for high bit-depth sad_aligned

x86: add Jaguar CPU detection

x86inc: Remove .rodata kludges

The Mach-O bug was fixed in yasm 0.8.0 and we don't support versions that old.

a.out was superseded by ELF on sane systems a few decades ago.

checkasm: Use 64-bit cycle counters

Prevents overflows that can occur in some cases.

checkasm: Fix stack alignment bug

Fix invalid memcpy in sliced-threads

Likely didn't actually break in practice, but memcpy with src==dst
is incorrect.

Fix two bugs in slice-min-mbs and slices-max

Slices-max broke slice-max-size when slice-max wasn't used.
Slice-min-mbs broke in rare cases near the end of a threadslice.

x86: SSSE3 LUT-based faster coeff_level_run

~2x faster coeff_level_run.
Faster CAVLC encoding: {1%,2%,7%} overall with {superfast,medium,slower}.
Uses the same pshufb LUT abuse trick as in the previous ads_mvs patch.

x86-64: BMI2 cabac_residual functions

x86: SSSE3 ads_mvs

~55% faster ads in benchasm, ~15-30% in real encoding.
~4% faster "placebo" preset overall.

x86: AVX2 pixel_ssd_nv12_core

x86: AVX2 high bit-depth pixel_ssd

x86: AVX2 high bit-depth pixel_sad_x3/pixel_sad_x4

Also reduce the number of xmm registers used by sse2/ssse3 pixel_sad_x3.

x86: AVX2 high bit-depth vsad

x86: AVX2 high bit-depth pixel_sad

Also use loops instead of duplicating code; reduces code size by ~10kB with
negligible effect on performance.

x86: AVX2 high_bit_depth pixel_avg2, get_ref, mc_copy_w16, mc_luma

Also reduce the number of xmm registers used by mc_copy_* to avoid
saving and restoring xmm6 and xmm7 on 64-bit Windows.

x86: AVX2 nal_escape

Also rewrite the entire function to be faster and drop the AVX version which is no longer useful.

x86: AVX memzero_aligned

x86: AVX2 predict_16x16_dc

x86: AVX2 predict_8x8c_p/predict_8x16c_p

x86: AVX2 predict_16x16_p

Also fix the AVX implementation to correctly use the SSSE3 inline asm
instead of SSE2.

x86: AVX high bit-depth predict_16x16_v

Also restructure some code to reduce code size of various functions,
especially in high bit-depth.

x86: AVX2 high bit-depth predict_4x4_h

x86: AVX2 high bit-depth predict_16x16_h

x86: AVX2 high bit-depth predict_8x8c_h/predict_8x16c_h

x86util: Support ymm registers in HADD macros

x86: more AVX2 framework, AVX2 functions, plus some existing asm tweaks

AVX2 functions:
zigzag interleave

x86inc: create xm# and ym#, analagous to m#

For when we want to mix simd sizes within one function.

x86inc: fix AVX emulation of cmp(p|s)(s|d)

x86-64: cabac_block_residual assembly

RDO: ~20% faster than C
Bitstream: ~50% faster than C
1-2% faster overall, highest on preset superfast/fast/medium.

OpenCL lookahead

OpenCL support is compiled in by default, but must be enabled at runtime by an
--opencl command line flag. Compiling OpenCL support requires perl. To avoid
the perl requirement use: configure --disable-opencl.

When enabled, the lookahead thread is mostly off-loaded to an OpenCL capable GPU
device. Lowres intra cost prediction, lowres motion search (including subpel)
and bidir cost predictions are all done on the GPU. MB-tree and final slice
decisions are still done by the CPU. Presets which do not use a threaded
lookahead will not use OpenCL at all (superfast, ultrafast).

Because of data dependencies, the GPU must use an iterative motion search which
performs more total work than the CPU would do, so this is not work efficient
or power efficient. But if there are spare GPU cycles to spare, it can often
speed up the encode. Output quality when OpenCL lookahead is enabled is often
very slightly worse in quality than the CPU quality (because of the same data

x264 must compile its OpenCL kernels for your device before running them, and in
order to avoid doing this every run it caches the compiled kernel binary in a
file named x264_lookahead.clbin (--opencl-clbin FNAME to override). The cache
file will be ignored if the device, driver, or OpenCL source are changed.

x264 will use the first GPU device which supports the required cl_image
features required by its kernels. Most modern discrete GPUs and all AMD
integrated GPUs will work. Intel integrated GPUs (up to IvyBridge) do not
support those necessary features. Use --opencl-device N to specify a number of
capable GPUs to skip during device detection.

Switchable graphics environments (e.g. AMD Enduro) are currently not supported,
as some have bugs in their OpenCL drivers that cause output to be silently

Developed by MulticoreWare with support from AMD and Telestream.

weightp: improve scale/offset search, chroma

Rescale the scale factor if the offset clips. This makes weightp more effective
in fades to/from white (and an other situation that requires big offsets).

Search more than 1 scale factor and more than 1 offset, depending on --subme.

Try to find the optimal chroma denominator instead of hardcoding it.

Overall improvement: a few percent in fade-heavy clips, such as a sample from
Avatar: TLA.

Add slices-max feature

The H.264 spec technically has limits on the number of slices per frame. x264
normally ignores this, since most use-cases that require large numbers of
slices prefer it to. However, certain decoders may break with extremely large
numbers of slices, as can occur with some slice-max-size/mbs settings.

When set, x264 will refuse to create any slices beyond the maximum number,
even if slice-max-size/mbs requires otherwise.

Add slice-min-mbs feature

Works in conjunction with slice-max-mbs and/or slice-max-size to avoid overly
small slices.
Useful with certain decoders that barf on extremely small slices.

If slice-min-mbs would be violated as a result of slice-max-size, x264 will
exceed slice-max-size and print a warning.

Disable mbtree asm with cpu-independent option

Results vary between versions because of different rounding results.

Show "avs: no" --disable-avs option instead of empty string

lavf input: don't use deprecated AVStream fields

Fixes building against newer libavcodecs from the Libav project.

Fix y4m input with C420paldv colorspace

x86: correctly check stack alignment for Atom hadamard_ac

Regression in r2265 (only affected compilers with broken stack alignment,
like ICL on win32).

x86inc: fix some corner cases of SWAP

SWAP with >=3 named (rather than numbered) args
PERMUTE followed by SWAP with 2 named args
used to produce the wrong permutation

Fix array overreads that caused miscompilation in gcc 4.8

Fix undefined behavior in x264_ratecontrol_mb

ARM: Fix bug in x264_quant_4x4x4_neon

Regression in r2273.

ARM: update NEON mc_chroma to work with NV12 and re-enable it

Up to 10-15% faster overall.

CABAC/CAVLC: use the new bit-iterating macro here too

quant_4x4x4: quant one 8x8 block at a time

This reduces overhead and lets us use less branchy code for zigzag, dequant,
decimate, and so on.
Reorganize and optimize a lot of macroblock_encode using this new function.
~1-2% faster overall.

Includes NEON and x86 versions of the new function.
Using larger merged functions like this will also make wider SIMD, like
AVX2, more effective.

Add AvxSynth support to the AviSynth input module.

Uses dlopen to load AvxSynth on Linux and OS X.

Allows the use of --demuxer avs for AvxSynth, though the only source filter it
can currently use is FFMS2.

Add a local copy of avxsynth_c.h and its dependent headers in extras/ so that
users don't need to actually have AvxSynth development headers installed to
enable support for it (mirroring the AviSynth behavior).

Based on a patch by 0x09 (

Eliminate some branchiness in ME/analysis

Faster, fewer branch mispredictions.

Fix some store forwarding stalls
There's quite a few others, but most of them don't help to fix or there's no
easy way to avoid them.

x86: faster AVX satd/sa8d/sa8d_satd/hadamard_ac

Use Conroe-style movddup in AVX transforms; both Sandy Bridge and Bulldozer
do movddup in the load unit, so it's totally free this way.

On Sandy Bridge:
~6% faster sa8d_satd
~5% faster hadamard_ac
~9% faster 32-bit satd
~2% faster sa8d

x86: detect Bobcat, improve Atom optimizations, reorganize flags

The Bobcat has a 64-bit SIMD unit reminiscent of the Athlon 64; detect this
and apply the appropriate flags.

It also has an extremely slow palignr instruction; create a flag for this to
avoid massive penalties on palignr-heavy functions.

Improve Atom function selection and document exactly what the SLOW_ATOM flag

Add Atom-optimized SATD/SA8D/hadamard_ac functions: simply combine the ssse3
optimizations with the sse2 algorithm to avoid pmaddubsw, which is slow on
Atom along with other SIMD multiplies.

Drop TBM detection; it'll probably never be useful for x264.

Invert FastShuffle to SlowShuffle; it only ever applied to one CPU (Conroe).

Detect CMOV, to fail more gracefully when run on a chip with MMX2 but no CMOV.

x86: combined SA8D/SATD dsp function

Speedup is most apparent for 8-bit (~30%), but gives some improvements
for 10-bit too (~12%).
64-bit only for now.

x86: port SSE2+ SATD functions to high bit depth

Makes SATD 20-50% faster across all partition sizes but 4x4.

x86: faster high bit depth ssd

About 15% faster on average.

x86: optimize and clean up predictor checking
Branchlessly handle elimination of candidates in MMX roundclip asm.
Add a new asm function, similar to roundclip, except without the round part.
Optimize and organize the C code, and make both subme>=3 and subme<3 consistent.
Add lots of explanatory comments and try to make things a little more understandable.
~5-10% faster with subme>=3, ~15-20% faster with subme<3.

Fix two bugs in predictor checking
pmv wasn't checked properly in some cases, as well as zero vector.
Output-changing portion of the following patch.

Improve lookahead-threads auto selection
Smarter decision to improve fast-first-pass performance in 2-pass encodes.
Dramatically improves CPU utilization on multi-core systems.

Tested on a quad-core Ivy Bridge (12 threads, 1080p):
Fast first pass:
veryfast: ~7% faster
faster: ~11% faster
fast/medium: ~15% faster
slow/slower: ~42% faster
veryslow: ~55% faster
veryfast: ~9% faster
(all others remained the same)

x86: Use SSE instead of SSE2 for copying data

Reduces code size because movaps/movups is one byte shorter than movdqa/movdqu.
Also merge MMX and SSE versions of memcpy_aligned into a single macro.

64-bit cabac optimizations

~4% faster PIC

~3% faster and 16 byte shorter cabac_encode_bypass
~8% faster cabac_encode_terminal
Benchmarked on Ivy Bridge

One instruction less in cabac_encode_bypass

configure: add QNX support

Windows: Enable DEP and ASLR

x86inc: Set ELF hidden visibility for global constants

x86inc: Add cvisible macro for C functions with public prefix

This allows defining externally visible library symbols.

Signed-off-by: Diego Biurrun <>

x86inc: rename program_name to private_prefix
Synced from libav.
The new name is more descriptive and will allow defining a separate public
prefix for externally visible library symbols.

x264.h: improve x264_encoder_reconfig documentation

Cosmetics: stricter definition of parameterless functions

Update "Install and compile x264" in doc/regression_test.txt

Fix possible non-determinism with mbtree + open-gop + sync-lookahead

Code assumed keyframe analysis would only pull one frame off the list; this
isn't true with open-gop.

x86: don't use the red zone on win64

x86-64: fix trellis asm with interlacing

Regression in r2145.
Assembly assumed array was [2][64] when it was actually [2][63].
Tiny (~0.1%) compression improvement.

x86-32: use simple nop codes for <= sse

The "CentaurHauls family 6 model 9 stepping 8" family of CPUs (flags:
fpu vme de pse tsc msr cx8 sep mtrr pge mov pat mmx fxsr sse up rng
rng_en ace ace_en) SIGILLs on long nop codes.

Bump dates to 2013

x86inc: Drop tzcnt workaround

It is no longer needed now that we've bumped the version requirement of yasm to 1.2.0.

AVX2/FMA3 version of mbtree_propagate
First AVX2 function for testing.
Bump yasm version to 1.2.0 for AVX2 support.

x86inc: Use VEX-encoded instructions in AVX functions
Automatically use VEX-encoding in AVX/AVX2/XOP/FMA3/FMA4 functions for all instructions that exists in a VEX-encoded version.
This change makes it easier to extend existing code to use AVX2.
Also add support for AVX emulation of a few instructions that were missing before.

x86inc: activate REP_RET automatically
Now RET checks whether it immediately follows a branch, so the programmer dosen't have to keep track of that condition.
REP_RET is still needed manually when it's a branch target, but that's much rarer.
The implementation involves lots of spurious labels, but that's ok because we strip them.

x86inc: support stack mem allocation and re-alignment in PROLOGUE
Use this in 8-bit loopfilter functions so they can be used if
there is no aligned stack (e.g. x86-32 MSVC or ICC 10.x).

Update config.guess and config.sub

Fix crash if the first frame is forced to a non-keyframe
This is obviously bad user input, but x264 shouldn't crash if it happens.

Fix build on ARM with binutils >=
GAS doesn't seem to like spaces in vld1 anymore, so remove those.

Fix pthread_join emulation on win32 and BeOS
Doesn't actually affect x264, but it's more correct.

Fix typo in r2222
Slightly wrong numbers in level table.

configure: fix gpac detection with -Wp,-D_FORTIFY_SOURCE=2

Solaris: use sysconf to get processor count
Solaris responds correctly to the same value as Cygwin, so let's use that.

lavf input: allocate AVFrame correctly
Allocate AVFrames correctly with avcodec_alloc_frame().
This caused crashes with newer libavcodecs that try to free frame extradata.

Fix crash when using libx264.dll compiled with ICL for X86_64

Fix possible issues with out-of-spec QP values
Fixes a possible regression in r2228.

Attempt to optimize PPS pic_init_qp in 2-pass mode
Small compression improvement; up to ~0.5% in extreme cases.
Helps more with small slice sizes (tiny resolutions or slice-max-size).
Note that this changes the 2-pass stats file format.

Improve slice header QP selection
Use the first macroblock of each slice instead of the last of the previous.
Lets us pick a reasonable initial QP for the first slice too.
Slightly improved compression.

Update level dpb size calculation to match newer H.264 spec
Doesn't actually change encoding behavior, but makes it more correct.
Warning messages should now be accurate at higher bit depths and non-4:2:0.
Technically, since it redefines x264_level_t, this is an API version increment.

Add support for the ffmpeg/vapoursynth high bit depth y4m extensions

x86inc: Rename 3dnow2 to 3dnowext
The name "3dnowext" is more common than "3dnow2". Doesn't affect x264.

x86inc: only define program_name if the macro is unset.
This allows overriding the value from outside the file.
This can be useful if x86inc.asm is used outside of x264.

Disable ARM NEON MRC CPU test for Apple devices
The Apple A6 CPU doesn't support performance counters, so this test caused a crash.

Fix crash with no-scenecut + mbtree

Fix reconfiguring to crf=0
Lossless mode can't currently be enabled mid-stream.

ICL's preprocessor doesn't handle it correctly.
This fix is similar to libav's fix in 0db2d9.

Fix use of deprecated av_close_input_file call

Fix pkg-config for dynamic vs static linking

Set libm in the configure script if the OS has libm
Prerequisite for another configure patch after this.
Idea copied from libpthread.

Enhance mb_info: add mb_info_update
This feature lets the callee know which decoded macroblocks have changed.

Fix mb_info_free with sliced threads
x264 would free mb_info before it was completely done using it.

Enhance nalu_process
Add the input frame opaque pointer to the arguments.
This makes it easier to use with multiple simultaneous x264 encodes.

Improve mb_info constant mb optimization
Allow fast skipping even if the pskip MV isn't zero.

Export the average effective CRF of each frame
Useful to judge the resulting quality of a frame when VBV is enabled.

Remove special-casing for OpenBSD pthread handling
Previously it was policy to use -pthread, but OpenBSD now recommends -lpthread.
its been libpthread anyway and policy has changed to stop using -pthread.

x86inc: automatically insert vzeroupper for YMM functions
Backported from libav.

Free user supplied data when deleting a frame
This eliminates a memory leak when calling x264_encoder_close.

Revert r2204
People don't seem to like this so I'm just going to get rid of it.

Faster predictor checking with subme<3
Fix a typo that made an early-skip less effective.
Avoid a relatively unpredictable branch.
Slightly changed output due to the typo-fix.
~50 cycles faster on Core i7.

Try 8x8 transform analysis even when sub8x8 partitions are present
Turn off the sub8x8 partitions, try it, and turn them back on if it didn't help.
Small compression improvement with p4x4 on (~0.1-0.5%).
Also update related comments.

Support changing resolutions between passes with macroblock-tree
Implement a basic separable bilinear filter to rescale the quantizer offsets.
Structure inspired by swscale, but floating-point instead of fixed-point.
Not as optimized as it could be, but it's quite fast already.

Example compression penalties on a 720p video game recording:
First pass with 720p and second as 480p: ~-1.5% (vs. same res)
First pass with 480p and second as 720p: ~-3% (vs. same res)

Print elapsed time in encoding progress indicator

Cap ratecontrol predictor parameters
Limits VBV mispredictions after long periods of relatively constant video.

x86inc: import patches from libav
Allow manual invocation of WIN64_SPILL_XMM even under INIT_MMX
SSE version of mova is movaps rather than movdqa.
YMM version of movnta.
Add mp size for named arguments.
Fix DEFINE_ARGS when used outside of a cglobal.
Define a few more cpuflags.
3-argument wrappers for a few more instructions.

Fix crash with --fps 0
Fix some integer overflows and check input parameters better.
Also fix incorrect type specifiers for demuxer info printing.

Threaded lookahead

Split each lookahead frame analysis call into multiple threads. Has a small
impact on quality, but does not seem to be consistently any worse.

This helps alleviate bottlenecks with many cores and frame threads. In many
case, this massively increases performance on many-core systems. For example,
over 100% faster 1080p encoding with --preset veryfast on a 12-core i7 system.
Realtime 1080p30 at --preset slow should now be feasible on real systems.

For sliced-threads, this patch should be faster regardless of settings (~10%).

By default, lookahead threads are 1/6 of regular threads. This isn't exacting,
but it seems to work well for all presets on real systems. With sliced-threads,
it's the same as the number of encoding threads.

Add support for RGB formats in bit-depth conversion filter

Fix some bugs in mb_info code

Add mb_info API for signalling constant macroblocks
Some use-cases of x264 involve encoding video with large constant areas of the frame.
Sometimes, the caller knows which areas these are, and can tell x264.
This API lets the caller do this and adds internal tracking of modifications to macroblocks to avoid problems.
This is really only suitable without B-frames.
An example use-case would be using x264 for VNC.

Faster chroma weight cost calculation

New assembly function with SSE2, SSSE3 and XOP implementations for calculating absolute sum of differences.

Add Level 5.2 support

Eradicate all mention of Extended Profile
x264 never supported it and never will because nobody uses it.

Fix disabling of mbtree when using 2pass encoding and zones

configure: force select -mXX gcc option for i386/x86-64
Makes multilib compilation more convenient.

Update config.guess and config.sub
Adds support for a bunch of targets, including:
aarch64 (armv8)

configure: correct use of RC variable and add --extra-rcflags

ICL/MSVS: Fix shared library generation and usage
MSVS requires exported variables to be declared with the DATA keyword, and requires that imported variables be declared with dllimport.
This does not fix x264 cli being unable to use a shared library built by ICL however.

Fix intra-refresh + hrd

Fix frame input colorspace check

Fix comment in deblock.c
The code does, in fact, handle CAVLC+8x8dct correctly already.

Fix sliced-threads ratecontrol bug
Was using qp instead of qscale; could cause NANs (not to mention less accurate results).

Fix clobbering of mutex/cvs
Regression in r2183.
Bizarrely seemed to work on many platforms, but crashed on win64 and may have been slower.
Only affected sliced threads during encoding, but could cause crashes on x264 encoder close even without sliced threads.

Sliced-threads: do hpel and deblock after returning
Lowers encoding latency around 14% in sliced threads mode with preset superfast.
Additionally, even if there is no waiting time between frames, this improves parallelism, because hpel+deblock are done during the (singlethreaded) lookahead.
For ease of debugging, dump-yuv forces all of the threads to wait and finish instead of setting b_full_recon.

Add full-recon API option
Fully reconstruct frames even without dump-yuv.

x86inc: switch to amdnops
Recent AMD CPUs' instruction decoders choke horribly on extremely long nops (i.e. with 4 prefixes).
Won't affect much, since we don't use ALIGN much.

BMI1 decimate functions
Intel was nice enough to make tzcnt equal to "rep bsf", which is backwards-compatible.
This means we don't actually have to add new functions to make it work.

Minor asm changes

Add row-reencoding support to VBV for improved accuracy
Extremely accurate, possibly 100% so (I can't get it to fail even with difficult VBVs).
Does not yet support rows split on slice boundaries (occurs often with slice-max-size/mbs).
Still inaccurate with sliced threads, but better than before.

Abstract bitstream backup/restore functions
Required for row re-encoding.

Add an small per-MB cost penalty for lowres
Helps avoid VBV predictors going nuts with very low-cost MBs.
One particular case this fixes is zero-cost MBs: adaptive quantization decreases the QP a lot, but (before this patch), no cost penalty gets factored in for this, because anything times zero is zero.

Remove explicit run calculation from coeff_level_run
Not necessary with the CAVLC lookup table for zero run codes.

Export PSNR/SSIM in x264 API

x86inc: support yasm -f win64
Not necessary for x264, as -m amd64 already does the right thing, but used by external users of x86inc.

Fix incorrect zero-extension assumptions in x86_64 asm
Some x264 asm assumed that the high 32 bits of registers containing "int" values would be zero.
This is almost always the case, and it seems to work with gcc, but it is *not* guaranteed by the ABI.
As a result, it breaks with some other compilers, like Clang, that take advantage of this in optimizations.
Accordingly, fix all x86 code by using intptr_t instead of int or using movsxd where neccessary.
Also add checkasm hack to detect when assembly functions incorrectly assumes that 32-bit integers are zero-extended to 64-bit.

Fix possible alignment crash when linking from MSVC
x264_cavlc_init needs to be stack-aligned now.

Fix rare overflow in 10-bit intra_satd_x3_16x16 asm

ICL: fix out of tree building and resource file usage on Windows

Add error handling for out-of-tree build

Fix RGB colorspace input
BGR/BGRA input was correct.

Fix interlaced + extremal slice-max-size
Broke if the first macroblock in the slice exceeded the set slice-max-size.

Fix regression in r2141
Broke register preservation in x264_cpu_cpuid and x264_cpu_xgetbv.
Did not cause any problems.

TBM, AVX2, FMA3, BMI1, and BMI2 CPU detection support
TBM and BMI1 are supported by Trinity/Piledriver.
The others (and BMI1) will probably appear in Intel's upcoming Haswell.
Also update x86inc with AVX2 stuff.

x86inc: add TAIL_CALL macro to abstract a common asm idiom

Minor asm optimizations/cleanup

Clean up and optimize weightp, plus enable SSSE3 weight on SB/BDZ
Also remove unused AVX cruft.

XOP frame_init_lowres
Covers both 8-bit and 16-bit, ~5-10% faster on Bulldozer.

XOP 8x8 zigzags
Field: 35(mmx) ->16(xop) cycles
Frame: 32(ssse3)->20(xop) cycles

AVX 32-bit hpel_filter_h
Faster on Sandy Bridge.
Also add details on unsuccessful optimizations in these functions.

x86inc: add high halfword register support
Might be useful in a few cases.

Change %ifdef directives to %if directives in *.asm files
This allows combining multiple conditionals in a single statement.

Use TV range algorithm for bit-depth conversions
Such sources are more common, so better to be correct for the common case.
This also produces less error for the case of full range than the previous algorithm produced for the case of TV range.

Bump dates to 2012

Add Windows resource file
Displays version info in Windows Explorer.

Fix win32 pthread_cond_signal
Isn't used by x264 currently, so didn't cause a problem.
Fix backported from libav.

ARM: align asm functions to 4 bytes.
Some linkers apparently fail to correctly align ARM functions when mixing with Thumb code.

Fix normalization of colorspace when input is packed YUV 4:2:2

Force keyint-min 1 with Blu-ray
Fixes an issue with referencing across I-frames that's prohibited in Blu-ray for some godforsaken reason.

Fix crash in --demuxer y4m with unsupported colorspace

Fix overread/possible crash with intra refresh + VBV

Fix trellis 2 + subme >= 8
Trellis didn't return a boolean value as it was supposed to.
Regression in r2143-5.

CABAC trellis opts part 4: x86_64 asm
Another 20% faster.
18k->12k codesize.

This patch series may have a large impact on encoding speed.
For example, 24% faster at --preset slower --crf 23 with 720p parkjoy.
Overall speed increase is proportional to the cost of trellis (which is proportional to bitrate, and much more with --trellis 2).

CABAC trellis opts part 3: make some arrays non-static

CABAC trellis opts part 2: C optimizations

Hoist the branch on coef value out of the loop over node contexts.
Special cases for each possible coef value (0,1,n).
Special case for dc-only blocks.
Template the main loop for two common subsets of nodes, to avoid a bunch of branches about which nodes are live.
Use the nonupdating version of cabac_size_decision in more cases, and omit those bins from the node struct.
CABAC offsets are now compile-time constants.
Change TRELLIS_SCORE_MAX from a specific constant to anything negative, which is cheaper to test.
Remove dct_weight2_zigzag[], since trellis has to lookup zigzag[] anyway.

60% faster on x86_64.
25k->18k codesize.

CABAC trellis opts part 1: minor change in output
Due to different tie-break order.

x86inc improvements for 64-bit

Add support for all x86-64 registers
Prefer caller-saved register over callee-saved on WIN64
Support up to 15 function arguments

High bit depth SSE2/AVX add8x8_idct8 and add16x16_idct8
From Google Code-In.

MMX/SSE2/AVX predict_8x16_p, high bit depth fdct8
From Google Code-In.

XOP 8-bit fDCT
Use integer MAC for one of the SUMSUB passes. About a dozen cycles faster for 16x16.

High bit depth intra_sad_x3_4x4
From Google Code-In.

Use a large LUT for CAVLC zero-run bit codes
Helps the most with trellis and RD, but also helps with bitstream writing.
Seems at worst neutral even in the extreme case of a CPU with small L2 cache (e.g. ARM Cortex A8).

High bit depth intra_sad_x3_8x8, intra_satd_x3_4x4/8x8c/16x16
Also add an ACCUM macro to handle accumulator-induced add-or-swap more concisely.

MMX 10-bit predict_8x8c_h and predict_8x16c_h
From Google Code-In.

Some MBAFF x86 assembly functions.
deblock_chroma_420_mbaff, plus 422/422_intra_mbaff implemented using existing functions.
From Google Code-In.

More ARM NEON assembly functions
predict_8x8_v, predict_4x4_dc_top, predict_8x8_ddl, predict_8x8_ddr, predict_8x8_vl, predict_8x8_vr, predict_8x8_hd, predict_8x8_hu.
From Google Code-In.

More 4:2:2 asm functions
High bit depth version of deblock_h_chroma_422.
Regular and high bit depth versions of deblock_h_chroma_intra_422.
High bit depth pixel_vsad.
SSE2 high bit depth and MMX 8-bit predict_8x8_vl.
Our first GCI patch this year!

SSE2 and SSSE3 versions of sub8x16_dct_dc
Also slightly faster sub8x8_dct_dc

Resize filter updates
Use AVPixFmtDescriptors to pick the most compatible x264 csp for any pixel format.
Fix deprecated use of av_set_int.
Now requires libavutil >= 51.19.0

Add out-of-tree build support

Limit SSIM to 100db
Avoids floating point error for infinite SSIM (lossless).

Fix wrong conditional inclusion of inttypes.h
inttypes.h is required by encoder/ratecontrol.c for SCNxxx macros, and HAVE_STDINT_H does not imply having inttypes.h.
stdint.h is a subset of inttypes.h, but this isn't enough for x264.
This change fixes building x264 with Android's toolchain.

Fix crash with sliced threads and input height <= 112

Fix loading custom 8x8 chroma quant matrices in 4:4:4

Fix PCM cost overflow

Fix overflow in 8-bit x86 vsad asm function

Fix crash in --fullhelp when compiled against recent ffmpeg
Don't assume all pixel formats have a description.

Fix regression in r2118
Broke trellis with i16x16 macroblocks.

Modify MBAFF chroma deblock functions to handle U/V at the same time
Allows for more convenient asm implementations.

CABAC trellis optimizations: use SIMD quant
Significant speed increase, minor change in output due to rounding.

YUV range detection and support for x264CLI
Two new options: --input-range and --range.
--input-range forces the range of the input in case of misdetection; auto by default.
-- range sets the range of the output; x264cli will convert if necessary, TV by default.
--fullrange is now removed as a CLI option (but the libx264 API is unchanged).

Pass through user data

Remove unpredictable branch in CABAC dqp

x86inc: AVX symmetry optimization
3-arg AVX ops with a memory arg can only have it in src2,
whereas SSE emulation of 3-arg prefers to have it in src1 (i.e. the move).
So, if the op is symmetric and the wrong one is memory, swap them.
Eliminates redundant moves in some cases when using 3-operand without AVX with memory arguments.
Also fix movss and movsd in some cases, and flag shufps correctly as float.

checkasm: shut up gcc warnings, fix some naming of functions in results

checkasm: fix build on ARM
Because of how ALIGNED_ARRAY_16 is defined on ARM, array initialisers cannot be used here. Use memset() instead.

Improve makefile rules
Remove the need for "make clean" after most reconfigures.

Mark some local functions as static, cosmetics

Fix crash if timecode file opening fails

Configure: force PIC for shared build on PARISC and MIPS

Improve yasm version check
Previous check allowed certain earlier versions that weren't fully compatible.

Add fenc prefetching to adaptive quant
Many fewer cache misses, faster adaptive quant.

Split prefetch_fenc between colorspaces
Add 4:2:2 version.

Some more 4:2:2 x86 asm
coeff_last8, coeff_level_run8, var2_8x16, predict_8x16c_dc, satd_4x16, intra_mbcmp_8x16c_x3, deblock_h_chroma_422

Remove obsolete versions of intra_mbcmp_x3
intra_mbcmp_x3 is unnecessary if x9 exists (SSSE3 and onwards).

SSSE3/SSE4/AVX 9-way fully merged i8x8 analysis (sa8d_x9)
x86_64 only for now, due to register requirements (like sa8d_x3).

i8x8 analysis cycles (per partition):
penryn sandybridge bulldozer
616->600 482->374 418->356 preset=faster
892->632 725->387 598->373 preset=medium
948->650 789->409 673->383 preset=slower

SSSE3/SSE4/AVX 9-way fully merged i8x8 analysis (sad_x9)
~3 times faster than current analysis, plus (like intra_sad_x9_4x4) analyzes all modes without shortcuts.

Merge i4x4 prediction with intra_mbcmp_x9_4x4
Avoids a redundant prediction after analysis.

Inline i4x4/i8x8 encode into intra analysis
Larger code size, but faster.

Initial XOP and FMA4 support on AMD Bulldozer
~10% faster Hadamard functions (SATD/SA8D/hadamard_ac) plus other improvements.

ARM: update NEON chroma deblock functions to NV12 pixel format

Add /usr/lib/{64/}values-xpg6.o to $LDFLAGS on Solaris
This is required for POSIX.1-2001 compliance.

Fix linker test for -Bsymbolic
The Solaris linker only accepts -Bsymbolic for objects compiled in dynamic mode (i.e. shared objects), so pass -shared to gcc.
Additionally, for x86_32 unresolved textrels cause a linker error so mark the .text section as 'impure'.

Add $SOFLAGS to exported SOFLAGS make variable

Allow setting a chroma format at compile time
Gives a slight speed increase and significant binary size reduction when only one chroma format is needed.

Improve profile help
List high422/high444 profiles, and don't show non-high-bit-depth profiles in high bit depth builds.

Fix infinite loop parsing TDecimate Mode 3 timecode v1 files

Fix some integer overflows/signedness errors found by IOC
The only real bug here is in slicetype.c, which may or may not affect real encodes.

Fix pixel_var2 with 4:2:2 encoding
Might have caused artifacts or suboptimal chroma compression.

Fix chroma intra analysis in 4:4:4 lossless mode

Fix use of uninitialized MVs in sub8x8 RDO

Fix detection of Alpha CPU arch on alphaev67

Optimize x86 asm for Intel macro-op fusion
That is, place all loop counter tests right before their conditional jumps.

CAVLC: clean up and restructure
Somewhat faster CAVLC and RD bit-counting.

CABAC: clean up and restructure
Somewhat faster CABAC and RD bit-counting.

Some initial 4:2:2 x86 asm

4:2:2 encoding support

SSSE3/SSE4 9-way fully merged i4x4 analysis (sad/satd_x9)

i4x4 analysis cycles (per partition):
penryn sandybridge
184-> 75 157-> 54 preset=superfast (sad)
281->165 225->124 preset=faster (satd with early termination)
332->165 263->124 preset=medium
379->165 297->124 preset=slower (satd without early termination)

This is the first code in x264 that intentionally produces different behavior
on different cpus: satd_x9 is implemented only on ssse3+ and checks all intra
directions, whereas the old code (on fast presets) may early terminate after
checking only some of them. There is no systematic difference on slow presets,
though they still occasionally disagree about tiebreaks.

For ease of debugging, add an option "--cpu-independent" to disable satd_x9
and any analogous future code.

Faster intra_mbcmp_x3 for versions without dedicated asm
Select asm subroutines more intelligently in the wrapper functions.

Optimize x86 intra_predict_4x4 and 8x8

High bit depth Penryn, Sandybridge cycles:
4x4_ddl: 11->10, 9-> 8
4x4_ddr: 15->13, 12->11
4x4_hd: , 15->12
4x4_hu: , 14->13
4x4_vr: 15->14, 14->12
8x8_ddl: 32->19, 19->14
8x8_ddr: 42->19, 21->14
8x8_hd: , 15->13
8x8_hu: 21->17, 16->12
8x8_vr: 33->19,

8-bit Penryn, Sandybridge cycles:
4x4_ddr: 24->15,
4x4_hd: 24->16,
4x4_hu: 23->15,
4x4_vr: 23->16,
4x4_vl: 10-> 9,
8x8_ddl: 23->15,
8x8_hd: , 17->14
8x8_hu: , 15->14
8x8_vr: 20->16, 17->13

Use realistic alignment for intra pred benchmarks in checkasm

Fix frame packing SEI with --frame-packing 0
According to the spec, when frame_packing_arrangement_type is equal to 0, quincunx_sampling_flag shall be equal to 1.

Fix install/uninstall shared libs if SYS is WINDOWS/CYGWIN

Add Hurd support to configure

Optimize x86 intra_satd_x3_*
~7% faster.

Optimize x86 intra_sa8d_x3_8x8
~40% faster.
Also some other minor asm cosmetics.

Scale interlaced refs/mvs for mvr predictors
Slightly improves compression and fixes a Valgrind error.

Optimize predict_8x8_filter and incidentally remove a valgrind false-positive

Don't override flat SSE2 dequant functions with non-flat AVX ones
Slightly faster.

Shut up some valgrind false-positives

Avoid some unnecessary allocations with B-frames/CABAC off

Fix typo in p8x8 RD analysis
Passed wrong idx to trellis.

Fix invalid memory accesses in x86 lowres_init when width <= 16

Fix intermediate conversion for YUVJ* pixfmts with 4:4:4 encoding

Fix pic_out returned by x264_encoder_encode with 4:4:4

Fix zeroing of mvr predictors in bskip blocks

Fix: chroma planes for weightp analysis were not initted if U early-terminates and V doesn't.

Expand borders before chroma weightp analysis
Prevents mc from using uninitialized source pixels.

Another 4:4:4 chroma weightp bug fix

Fix typo in help

Improve support for varying resolution between passes
Should give much better quality, but still doesn't support MB-tree yet.
Also check for the same interlaced options between passes.
Various minor ratecontrol cosmetics.

asm cosmetics: base-4 constants for shuffles

Enable some existing asm functions that were missing function pointers
pixel_ads1_avx, predict_8x8_hd_avxx
High bit depth mc_copy_w8_sse2, denoise_dct_avx, prefetch_fenc/ref, and several pixel*sse4.

Remove some unused, broken, and/or useless functions
Unused frame_sort.
Unused x86_64 dequant_4x4dc_mmx2, predict_8x8_vr_mmx2.
Unused and broken high_depth integral_init*h_sse4, optimize_chroma_*, dequant_flat_*, sub8x8_dct_dc_*, zigzag_sub_*.
Useless high_depth dequant_sse4, dequant_dc_sse4.

asm cosmetics: merge all the variants of ABS macros

asm cosmetics part 2
were split out of the cpuflags commit because they change the output executable.

asm cosmetics: INIT_MMX/XMM/YMM now support a cpuflags argument

Reduces the number of macro args that need to be passed around.
Allows multiple implementations of a given macro (e.g. PALIGNR) to check
cpuflags at the location where the macro is defined, instead of having
to select implementations by %define at toplevel.
Remove INIT_AVX, as it's replaced by "INIT_XMM avx".

This commit does not change the stripped executable.

Import x86inc.asm patches from libav

Cosmetics: s/mmxext/mmx2/

Fix two bugs in 4:4:4 chroma weightp analysis
Caused slightly worse compression.

Fix "--asm avx"
Previously required "--asm sse2fast,fastshuffle,sse4.2,avx".

Re-add support for glibc <2.6, which doesn't have CPU_COUNT

Avoid using deprecated libavformat functions
Replace av_find_stream_info with avformat_find_stream_info.
Now requires libavformat 53.3.0 or newer.

Use assembly versions of some deblocking functions in MBAFF

Move X264_VERSION / X264_POINTVER from config.h to x264_config.h
This makes them available to external programs as part of the public API.

Fix padding bug in x264_expand_border_mbpair

Timecode parsing: Add missing initialization
Fix crash when failed to parse timecode file before malloc pts.
Fix detection of user timebase considered to be exceeding H.264 maximum.

Fix crash with high bitdepth 4:2:0 input

x86 asm cosmetics
Use FDEC_STRIDEB where appropriate.

Fix a bug in lossless sub-8x8 RD
Caused crashes in rare cases with lossless encoding. Regression in 4:4:4.

Improved p8x4/4x8 search decision
Use the same thresholding as for p16x8/8x16.
Does p8x4/4x8 search more often, for a small compression improvement.

Add --subme 11, which disables all early terminations in analysis
Necessary for a future trellis mode decision/motion estimation patch.
Also add the slowest presets to the regression test.

Some trivial changes to RD thresholds
The output-changing portion of the next patch.

Allow setting a wider range of chroma QP offsets
This allows use of the full range of chroma QP offsets, even in combination with the automatic psy-based adjustments.

Optimize macroblock_deblock_strength, add more early terminations

Function-pointerify MBAFF deblocking functions

Clean up MBAFF deblocking code

Optimize frame_deblock_row

Shrink two arrays

Add support for the new (4:4:4) colorspaces to x264_picture_alloc

Various cosmetics

Improve configure help

Use $optarg for some configure options

Linux x264_cpu_num_processors(): use glibc macros
The cpu_set_t structure is considered opaque.
Also handle sched_getaffinity() error case if "cpusetsize is smaller than the size of the affinity mask used by the kernel."

Fix spurious "stream properties changed" with --seek option on some inputs

Fix use of deprecated libavcodec functions
Replace avcodec_open with avcodec_open2. Now requires libavcodec 53.6.0 or newer.

Fix nalu_process callback with HRD

Fix incorrect chroma swap for some input pixfmts

Problem occurred if pixfmt of lavf/ffms input was PIX_FMT_RGB24 or PIX_FMT_YUV444P.

Fix resize filter crash with YUVJ* input pixfmt

RGB encoding support
Much less efficient than YUV444, but easy to support using the YUV444 framework.

4:4:4 encoding support

Properly weight slice header lambda in chroma weightp analysis

Better x86 high bit depth predict_8x8c_p
Avoid the need to check for corner cases by reordering arithmetic.
Also make a minor optimization to high bit depth predict_16x16_p.

Eliminate extra layer of indirection for sps/pps references
Also remove poc type 1 support (it didn't work anyways) to reduce sps size.

Fix SSIM calculation with sliced threads

Avoid possible NaNs in B-frame output stats

ARM: do not override the toolchain default for FPU ABI

Fix link errors with libswscale/libavutil as shared libraries

Fix deprecation in libavformat usage
Replace av_open_input_file with avformat_open_input. Now requires libavformat 53.2.0 or newer.

Fix various issues with VBV+threads
Eliminate the race condition with interframe row predictors and threads.
Recalculate frame_size_estimated at the end of a frame, for improved update_vbv_plan.
Some cosmetics.

Fix MBAFF row VBV ratecontrol
Reverts most of r1984 and implements a much simpler solution.

Make ratecontrol_mb less slow

Resize filter updates
Fix use of deprecated sws_getContext.
Fix uses of sws_format_name.
Fix stream change warning not occurring on the first resolution change.
Drop cpu detection, as it is now performed internally by swscale.
Update swscale version requirements.

AVX mbtree_propagate
Up to ~20-30% faster than SSE2 on Sandy Bridge.

Use -vsync 0 with ffmpeg regression test

Inline emms instructions on x86 if possible

Make left_index_table const
Should allow for some missed compiler optimizations in macroblock_cache_load.

Make --profile main/baseline force off CQMfile

Fix VBV bug caused by zero i_row_satd value for first and last row

Fix crash with VBV + forced QP

Fix VBV bug with MinCR limit

Fix bitstream reallocation with slice-max-size + MBAFF

Improve build system capabilities
Make static lib and CLI optional.
Support linking CLI to system libx264.
Don't strip by default, to match GNU packaging guidelines.

Slightly speed up x86 CABAC asm
Also make some various cleanups.

Faster pixel_memset
~4x faster.
Also inline plane_expand_border for improved constant propagation.

Add checkasm tests for memcpy_aligned, memzero_aligned
Also make memcpy_aligned support sizes smaller than 64.

MBAFF: Add regularization to VSAD metric
Bias towards the MBAFF decisions made in neighboring mb pairs.
~2% better compression on a random 1080i HDTV source.

MBAFF: Improve handling of bottom row mod32 padding
Force skip on any MBs entirely outside the frame
If an mb pair in the bottom row is chosen to be progressive, re-pad the bottom rows progressively.

MBAFF: Add frame/field MB stats

MBAFF: Template direct spatial

MBAFF: Template cache_load and cache_load_neighbours

MBAFF: Make interlaced support a compile time option

MBAFF: Don't call zigzag_init for every mb

MBAFF: Modify ratecontrol to update every two rows

MBAFF: Add support for slice-max-size

Also add slice-max-size to the regression tests.

MBAFF: Add support for slice-max-mbs

MBAFF: Adaptive quantization

Compute energy for interlaced and progressive choices and pick the least.

MBAFF: Enable adaptive MBAFF with VSAD decision

MBAFF: Create a VSAD DSP function

x86 assembly by Jason Garrett-Glaser. This gives roughly 30x speed
increase over the C version.

MBAFF: Direct spatial

MBAFF: Direct temporal

MBAFF: Calculate bipred POCs

Need to calculate two tables for the cases where the current macroblock is
progressive or interlaced as refs are calculated differently for each.

MBAFF: Use both left macroblocks for ref_idx calculation

MBAFF: First edge deblocking

MBAFF: Implement left edge deblocking functions

MBAFF: Add extra data to the deblock strength structure

MBAFF: Deblocking support

MBAFF: Move common code from deblock functions

MBAFF: Add mbaff deblock strength calculation

Move call to deblock_strength to x264_macroblock_deblock_strength to
keep deblock strength calculation in one place.

MBAFF: Update x264_cabac_mvd_sum_mmxext to work with larger MVDs.

Author: Loren Merritt <>

MBAFF: Clamp MVDs to 66 instead of 33

MBAFF: CABAC encoding of skips

MBAFF: Track what interlace decision the decoder is using

MBAFF: Fix mvy bounds

Fix MV clipping

MBAFF: Copy deblocked pixels to other plane

MBAFF: Disallow skip where predicted interlace flag would be wrong

MBAFF: Inter support

MBAFF: Neighbour calculation

Back up intra borders correctly and make neighbour calculation several times longer.

MBAFF: Store references to the two left macroblocks

MBAFF: Store left references in a table

MBAFF: Disable adaptive MBAFF when subme 0 is used

MBAFF: Save interlace decision for all macroblocks

Fix bug in NAL buffer resizing
Also properly terminate if NAL buffer resizing fails.

Fix zone bitrate multiplier and QP forcing in 2-pass mode
Previously zone changes could affect frames outside of the given frame range (around 20 neighboring frames).

Use float constants in qp rounding
Slight performance improvement and fixes slight difference in output between gcc 3.4 and 4.5.

Fix bugs with ratecontrol reconfiguration
Initialization of some parameters was missed or wasn't synchronized with other threads

More validation of input parameters
This fixes a crash with --me umh and insane values of --me-range.

Fix bug in --b-adapt 2 with --rc-lookahead >248
Problem caused by buffer overflow in strcpy.

Check for invalid pixfmts in lavf demuxer

in r1944
roke sliced-threads + slice-max-size/slice-max-mbs.

Precalculate CABAC initialization contexts
Slightly faster encoding with lots of slices.

Avoid redundant log2f calls in mv cost initialization
Saves around 100 million clock cycles on x264 init.

CABAC residual: cleanup and optimizations
Also kill all Hungarian notation while we're at it.
Trim an instruction off cabac_encode_bypass.

Validate input parameters more carefully
Get rid of redundant warnings upon encoder_reconfig calls.
Also avoid encoder_reconfig turning off psy_rd/trellis.

Fix VFR MB-tree to work as intended
Should improve quality with FPSs much larger or smaller than 25.

Support more recent GPAC versions

Fix decoder desync with positive --chroma-qp-offset and zones


Fixes build with lavf/lavc 53.

Force pic-struct for Blu-ray compat + fake-interlaced

Fix open-gop with no-psy

Fix build with disabled asm

Improve Blu-ray compliance
Use dec_ref_pic_marking SEIs to repeat B-ref referencing information.
Don't allow B-frames to reference frames outside their minigop.

Consolidate Blu-ray hacks into --bluray-compat
This option is now required for Blu-ray compatibility.
--open-gop bluray is now gone (using bluray-compat and open-gop implies a Blu-ray compatible open-gop).
This option doesn't automatically enforce every aspect of Blu-ray compatibility (e.g. resolution, framerate, level, etc).

Add SSE support to rectangle.h for 16-byte stores
Uses GCC vector intrinsics; may be suboptimal on particularly old GCC versions.

Do not force Intel Compiler to target pre-mmx architecture for x86
Caused a speed penalty against gcc equivalents.

Warn users when using --(psnr|ssim) without --tune (psnr|ssim)
This is a counter to the proliferation of incredibly stupid psnr/ssim "benchmarks" of x264 in which the benchmarker conveniently "forgot" --tune psnr/ssim, crippling x264 in the test.

Remove redundant mbcmp calls in weightp analysis

Use integer math for filler size calculation

Disable progress for FFMS input with --no-progress

Fix bug in intra-refresh ratecontrol
Row SATDs were slightly incorrect.

Cosmetics: fix some signedness issues found by -Wsign-compare

Minor fixes
Fix a comment typo.
Align an array properly.
Make x264_scan8 unsigned: saves a bunch of movsxd instructions on x86_64.

Improve C99 support checks in configure
Fixes configuration with Intel compiler in some cases.

Eliminate the possibility of CAVLC level code overflow
Instead, if it happens, just re-encode the MB at higher QPs until it fits.

x86 SIMD versions of optimize_chroma_dc
SSE2/SSSE3/SSE4/AVX implementations.
About 3x faster.

Add Altivec version of mc_weight

Add Altivec versions of mbcmp_x functions
These aren't merged versions, they just call the existing asm code.
A merged implementation would of course be faster.

Recognize cygwin as itself when not targeting mingw
Also fix broken thread detection on cygwin.

Patch Intel's CPU dispatcher
Reduces Intel Compiler's bias against non-Intel CPUs.

Big thanks to Agner for the original information on how to do this.

Intel Compiler support

Big thanks to David Rudie, the original author of this patch.

Cosmetics: make struct definition braces consistent

Fix restoring of console title on Windows with ffms indexing

Fix possible buffer overflow in mp4 muxer

Remove inline asm syntax not supported by LLVM's assembler
Doesn't affect compiled output outside of LLVM.

Fix 10L in r1912
SSSE3 code got used in MMX/SSE2 and vice versa (in hpel).

Add AVX functions where 3+ arg commands are useful

Frame-packing 3D: don't place scenecuts on right views
Caused problems for some players.

Improve slice-max-size handling of escape bytes
More accurate but a bit slower. Helps deal with a few obnoxious corner cases where the current algorithm failed.

Use bs_write1 wherever possible in header writing

Remove obsolete mvcost init code

Fix memory leak on encoder close if not all frames are flushed

Fix signedness bug in CPU detection
Luckily didn't affect anything due to C signedness rules.

Fix dumb bug caused by stray semicolon
Caused noise reduction to run incorrectly in part of RD, but probably had no effect.

Fix malloc of zero size
Caused x264 to fail with some settings on systems that return a NULL pointer for malloc(0), like Solaris.

Fix crash in mp4 muxer after failure of x264_encoder_open

Fix shadowed variable warning in ffms.c

Fix some Intel compiler warnings

Fix 10L in r1886
Aspect ratio can't be set before SPS is initted.

Improve update interval of x264cli progress information
Now updates every 0.25s instead of every N frames.

Windows: restore previous console title after encoding
MSDN docs claim that SetConsoleTitle's effect is reverted when the process terminates, but this doesn't always work properly.
Accordingly, manually revert the console title at the end of encoding.

Allow WEIGHTP_FAKE in interlaced mode
It seems to work fine as-is even though real weightp doesn't support interlacing yet.

Output pic struct information in libx264 API

Enable FastShuffle on Penryn and Nehalem CPUs without SSE4

Hotfix for some bugs in VBV emergency

Fix warnings in cpu.c

Check for OS AVX support in addition to CPUID
Even if not using ymm registers, AVX operations will cause SIGILLs on unsupported OSs.
On Windows, AVX is only available on Windows 7 SP1 or later.

VBV emergency mode
Allow ratecontrol to select "quantizers" above the maximum.
These "quantizers" progressively decimate the source to avoid VBV underflow.
x264 is now VBV compliant even with input as evil as /dev/random.

Initial AVX support
Automatically handle 3-operand instructions and abstraction between SSE and AVX.
Implement one function with this (denoise_dct) as an initial test.
x264 can't make much use of the 256-bit support of AVX (as it's float-only), but 3-operand could give some small benefits.

Double the base framerate for frame-sequential 3D files
A 60fps frame-sequential 3D file is really only 30 FPS, just alternating between eyes.
Accordingly, ratecontrol should treat it as if it was really 30 FPS.
This will increase the bitrate at the same CRF level for such videos when --frame-packing 5 is used.

Add --input-fmt option to lavf input
Conforms to ffmpeg's `-f` option.
Use this when lavf fails to guess the input format.

Two improvements to regression test script
Use SHA-1 hashes for temporary file names to avoid exceeding OS filename length limits.
Correctly return to the original branch after testing if you were on a branch.

Add some missing values to the non-extended SAR table

Bump dates to 2011

More correctly write frame-packing SEI flags

Bug reported by Nero.

Don't die in x264_encoder_close if an error occurred in x264_encoder_encode
Also clean up properly in x264.c (mostly useful for finding bugs in cleanup).

Fix reconfiguration of b_tff
Attempting to change field order during encoding could cause slight corruption.

Also fix delta_poc_bottom to be correctly set if interlaced mode is used without B-frames.

Fix x264 CPU detection with >=64 CPUs on Windows
x264 won't actually use more than one processor group's worth of CPUs, however.
This isn't a problem, as a single x264 instance can't effectively use a full 64 cores anyways.

Remove high bit depth mmx quant
It was using pmuludq which is sse2, and the function isn't really possible without pmuludq.

Fix cacheline check in avg2 w20 cache32
Didn't result in incorrect output, only slightly decreased speed on a few obsolete systems.

instruction in high bit depth ssd_nv12_mmxext

VFR/framerate-aware ratecontrol, part 2
MB-tree and qcomp complexity estimation now consider the duration of a frame in their calculations.
This is very important for visual optimizations, as frames that last longer are inherently more important quality-wise.
Improves VFR-aware PSNR as much as 1-2db on extreme test cases, ~0.5db on more ordinary VFR clips (e.g. deduped anime episodes).

WARNING: This change redefines x264's internal quality measurement.
x264 will now scale its quality based on the framerate of the video due to the aforementioned frame duration logic.
That is, --crf X will give lower quality per frame for a 60fps video than for a 30fps one.
This will make --crf closer to constant perceptual quality than previously.
The "center" for this change is 25fps: that is, videos lower than 25fps will go up in quality at the same CRF and videos above will go down.
This choice is completely arbitrary.

Note that to take full advantage of this, x264 must encode your video at the correct framerate, with the correct timestamps.

Improve reference ordering in interleaved 3D video
Provides a decent compression improvement when encoding interleaved 3D content (--frame-packing 5).
Helps more without B-frames and at lower bitrates.
Note that x264 will not do this optimization unless --frame-packing 5 is used to tell x264 that the source is interleaved 3D.

Tests consistently show that interleaved frame packing is by far the best way to compress 3D content.
It gives a ~35-50% compression benefit over separate streams or top/bottom or left/right coding.

Also finally add support for L1 reference reordering (in B-frames).
Also add support for reordered ref0 in L0 and L1 lists; could be useful in the future for other things.

Cosmetics: fref0/1 -> fref[2] and i_ref0/1 -> i_ref[2]
A much-needed refactoring, plus makes the next patch easier.

Check an extra offset during weightp analysis
Up to 0.1 - 0.6 dB gain on some fade-ins with --weightp 1, less with --weightp 2.

SSE2 high bit depth SSIM functions

Patch from Google Code-In.

SSE2 high bit depth intra_predict_(8x8c|16x16)_p

Patch from Google Code-In.

MMX high bit depth coeff_last4

Patch from Google Code-In.

SSE2 high bit depth zigzag_interleave_cavlc

Patch from Google Code-In.

MMX/SSE2/SSSE3 high bit depth frame_init_lowres functions

Patch from Google Code-In.

MMX high bit depth 4x4 intra predict functions
DDR and HD directions, as well as making HU faster.
Also enable some SSE2 versions of high bit depth functions that were added but not properly enabled.

Patch from Google Code-In.

SSE2 high bit depth 8x8 intra predict functions
DDL, DDR, VR, HU, and HD directions, as well as the 8x8 filter.
Also make 8-bit MMX VR faster, by backporting the optimizations from the high bit depth version.

Patch from Google Code-In.

MMX/SSE2 high bit depth 8x8c intra predict functions

Patch from Google Code-In.

MMX version of high bit depth plane_copy
And various cosmetics.

Patch from Google Code-In

Faster x86 predict_8x8c_dc, MMX/SSE2 high bit depth versions

SSSE3 high bit depth sad_aligned functions

MMX/SSE2 high bit depth interleave functions

Patch from Google Code-In.

MMX/SSE2 high bit depth avg functions

Patch from Google Code-In.

MMX/SSE2 high bit depth deinterleave functions

Patch from Google Code-In

Shut up some incorrect gcc uninitialized variable warnings

Write --crop-rect and --frame-packing options to x264 SEI

Add missing space to parameter SEI

Fix typo in documentation

Fix redundant linebreaks in statsfile with weightp

Use cross_prefix for strings in endian test and as test

Fix checkasm test for quant in high bit depth
Eliminate some spurious failures.

Fix broken YV12 handling in the resize filter

Fix bug with negative lookahead mb costs in high bit depth

Fix overflow in SSIM calculation in 10-bit

Fix some possible overflows in VFR ratecontrol with extreme timebases

Fix memory leak in lavf demuxer.
Leak only occurred with input files that have more than one video stream.

Fix satd predictors with high bit depth
Resulted in odd CRF-mode results with --no-mbtree, as well as suboptimal VBV handling.

Fix compile error with high bit depth and disable-asm

Really fix gcc win32 misalignment crash
gcc's -fno-zero-initialized-in-bss only works if an explicit initializer (e.g. = {0}) is used.

Support for native Windows threads

Patch originally by Pegasys Inc.

MMX/SSE2 high bit depth weight_cache/offset(sub|add) functions

Patch from Google Code-In.

SSE2 high bit depth dequant functions

Patch from Google Code-In.

SSE2 high bit depth zigzag functions

Patch from Google Code-In.

MMX/SSE2 versions of high bit depth store_interleave

Patch from Google Code-In.

Add frame-packing SEI support for signalling 3D video

Allow 8x8dct+cavlc+lossless with subme>=6

Add interlaced/no-interlaced case to regression test script

Save more memory with weightp in >8-bit

.gitignore more untracked file types

Work around gcc/ld alignment bug on win32
Fixes problems due to misalignment of static zero arrays (win32 ld can't align .bss properly).

Fix high bit depth intra pred functions
And re-enable them accordingly.

Patch from Google Code-In.

Fix weightp analysis with high bit depth

Fix build error in high depth
Caused by multiple definitions of x264_add8x8_idct_sse2.

Hotfix for high bit depth
Temporary fix for some unaligned access crashes.

Delete x264_config.h on distclean

Tons of high bit depth intra predict asm

Patch from Google Code-In.

SSE2 high bit depth 8x8/16x16 idct/idct_dc

Patch from Google Code-In.

Create and install x264_config.h
This header can be used to determine the bit-depth and license of libx264.

Detect Avisynth initialization failures
Detect if there is a critical Avisynth initialization failure and print the associated error.
This, however, requires a feature present in the latest version of Avisynth alpha (2.6).
Previous versions are unaffected.

Automatically restrict QPs to avoid quantization (under|over)flow
--cqm jvt and similar should now work "out of the box" instead of requiring futzing with --qpmin.

Don't try to get timecodes if reading frame failed
This fixes "input timecode file missing data for frame" warning with piped input where we don't know total number of frames.

Fix possible overflow in sub4x4_dct in 10-bit builds

Fix bug in intra-refresh + threads
Intra refresh bar quality increase wasn't correctly applied.

Fix file handle leak in libx264 on error

Fix incompatible csp format issue
Problem occurred with unknown pixel formats and non mod2 resolutions in the resize filter.

Really fix fittobox resize rounding code

Fix regression in rev1549
Skip auto timebase denominator generation when generated timebase denominator exceeds UINT32_MAX.
Also fix double free.

Fix --tcfile-in if timecode v2 file starts from nonzero pts

SPARC/Solaris build fixes

Fix typo in r1797

Add Python regression test script

Patch from Google Code-In.

Make --weightp 1 a better speed tradeoff
Since fade analysis is now so fast, weightp 1 now does fade analysis but no reference duplication.
This is the opposite of what it used to do (reference duplication but no fade analysis).
This also gives weightp's better fade quality to faster presets (up to superfast).

SSE versions of some high-bit-depth DCT functions
Our first Google Code-In patch!

Clean up weightp analysis function

Add API function to return max number of delayed frames

Copy field order flag in encoder_reconfig

Cosmetics in configure

Add some more info to `x264 --version`

Change qpmin default to 0
There's probably no real reason to keep it at 10 anymore, and lowering it allows AQ to pick lower quantizers in really flat areas.
Might help on gradients at high quality levels.
The previous value of 10 was arbitrary anyways.

Fix ticks_per_frame check for VFR input

Fix configure so that boolean configuration options are 1/0

There are many cases of 1/undef, not 1/0.

Only build SPARC VIS asm if high bit-depth is disabled

Fix build on SPARC Solaris 10

Fix resize filter rounding code

Fix regression in chroma weightp
Missing cache calls could cause artifacts, encoder/decoder desync.

Fix some crashes with high bit depth
Not all arrays were sufficiently aligned.

Chroma weighted prediction
Like luma weighted prediction, dramatically improves compression in fades.
Up to 4-8db chroma PSNR gain in extreme cases (short, perfect fade-outs).
On actual videos, helps up to ~1% overall.
One example video with a decent number of fades (ef OP): 0.8% bitrate reduction overall, 7% bitrate reduction just counting chroma.
Fixes a lot of artifacts in fades at lower bitrates.

Original patch by Dylan Yudaken <>.

Support custom cropping rectangles
Supposedly useful for 3D television applications.

Less verbose.

x86 asm for high-bit-depth pixel metrics
Overall speed change from these 6 asm patches: ~4.4x.
But there's still tons more asm to do -- patches welcome!

Breakdown from this patch:
~13x faster SAD than C.
~11.5x faster SATD than C (only MMX done).
~18.5x faster SA8D than C.
~19.2x faster hadamard_ac than C.
~8.3x faster SSD than C.
~12.4x faster VAR than C.
~3-4.2x faster intra SAD than C.
~7.9x faster intra SATD than C.

x86 asm for some high-bit-depth coefficient functions
~7.9x faster denoise than C.
~2.3x faster coeff_level_run than C.
~6.6x faster coeff_last than C.
~4.3x faster decimate_score than C.

Also improve checkasm's decimate_score test.

x86 asm for high-bit-depth motion compensation
~8x faster qpel MC than C.
~10x faster hpel than C.

x86 asm for high-bit-depth quant
~3.1-4.2x faster than C.

x86 asm for high-bit-depth DCT
Only MMX and DCT done so far; iDCT still needs asm as well.
~4.4x faster than C.

x86 asm for high-bit-depth deblocking
~3.3x faster than C.

Use a 16-bit buffer in hpel_filter regardless of bit depth
This only works up to and including 10-bit (but we don't support anything higher yet).

Use enums instead of magic numbers in x264_mb_partition_pixel_table

Improve configure script logging
Now prints the test program that failed in addition to error messages.

Fix constrained intra pred mode selection

Various high-bit-depth ratecontrol fixes

Fix a crash in --dump-yuv for odd resolutions

Improve flash detection algorithm change in r1765
Now only disables scenecuts only near real end of video, not just prior to forced keyframes.

Update ffms2 support for its latest API break.

Modify the x264 header accordingly if --disable-gpl is used

Save a bit of memory with weightp + high bit depth

Fix bugs in qpfile parsing with omitted QPs

Fix HRD with intra-refresh
x264 was incorrectly calculating cpb_removal_delay with respect to the first keyframe.
It should have been calculating cpb_removal_delay with respect to the last keyframe.

Fix bug in r1753
Overflow compensation fix broke CRF with --no-mbtree.

Improve flash detection's behavior near the end of the video
Flash detection catches situations like AAAABBCCDDDD, where A,B,C,D are frames in different scenes.
x264 would place a keyframe on the first "D".
However, if the video ended on the last "C", x264 would place a keyframe on the first "C", even though C classifies as a flash.
This change fixes this issue.

Improve quantizer handling
The default value for i_qpplus1 in x264_picture_t is now X264_QP_AUTO.This is currently 0, but may change in the future.
qpfiles no longer use -1 to indicate "auto"; QP is just omitted.The old method should still work though.

CRF values now make sense in high bit depth mode.
--qp should be used for lossless mode, not --crf.
--crf 0 will still work as expected in 8-bit mode, but won't be lossless with higher bit depths.
Add bit depth to statsfiles.

These changes are required to make the QP interface sensible in combination with high bit depth.

VFR-aware PSNR/SSIM measurement
First step to VFR-aware MB-tree and bit allocation.

Disable weightp offset=-1 dupes with high bit depth
They're a hack to compensate for crappy rounding, and thus not worth doing at high bit depth, which fixes most of the rounding issues.

Make the ffmpeg -vpre error message more descriptive

Add numeric names for the presets (0==ultrafast ... 9==placebo)
This mapping will of course change if new presets are added in between, but will always be ordered from fastest to slowest.

Update benchmarks in doc/threads.txt

Make the #if'd out naive ESA actually match the real implementation

Move mv/ref prefetch code to the correct location
Prefetching of top blocks should be done under if(top), not if(left).

Link x264cli explicitly against lavf
Fixes some problems with crappy linkers.

Fix CBR ratecontrol bug with extremely high qscales
Caused CBR ratecontrol to take a very long time to recover from extreme situations (e.g. /dev/urandom).

Disable overflow compensation in CRF mode
Wasn't designed with CRF in mind, and acts really weird with CRF+VBV.

Fix stupid bug in B-frame VBV size prediction

Fix regression in checkasm in r1666
Buffer is uint16_t* regardless of whether x264 was compiled with high bit depth or not.

Fix overflows in satd, sa8d and hadamard_ac with high bit depth

Fix potential problem with overflows in ssd_nv12
The risk of overflows increases exponentially with the bit depth.
The 8-bit asm versions may still overflow with image widths >= 11008 (or 6604 if interlaced).

Fix syntax for some parameterless functions
Technically, such functions should be declared with (void), not ().

Fix fps reporting on mingw64
_ftime on mingw64 uses __timeb32 which is broken.
Use ftime instead.

Fix compilation on PPC with some recent GCCs

Fix Altivec SATD with small strides
Fixes chroma ME and some of lookahead on PPC.

Address remaining cacheline split issues in avg2
Slightly improved performance on core 2.
Also fix profiling misattribution of w8/16/20 mmxext cacheline loops.

Trim a few bytes off some x86 intra pred functions

Move DTS compression from libx264 to x264cli
DTS compression is an ugly stupid hack and starting to encroach on unrelated areas like VBV.
Some people want it in the mp4 muxer for devices and/or splitters that don't support Edit Boxes.
We just say "throw these broken devices out the window".
DTS compression will remain as a muxer option, --dts-compress, at the user's own risk.
This option is disabled by default.

Use a larger pic_init_qp with high bit depth
Modify pic_init_qs for consistency.

Update some of the information in doc/

Update header in depth.c

Remove some old unused stuff in the build tree
Regression test (hasn't been updated since svn).
Doxy (was never used).

Various cosmetics
Exorcise some CamelCase.

Add missing mod4 stack check to sse2_misalign mc_chroma
Required for ICC compilation.

Fix 2pass ratecontrol with --nal-hrd cbr

Fix minor bug in intra pred with intra refresh
i8x8 blocks didn't properly avoid predicting from top-right when necessary.
This could cause intra refresh to not completely refresh the frame.

Fix filter parsing with --extra-cflags="-DNDEBUG"

Make sigint handler variable volatile
Didn't actually cause any problems, but is necessary because it can be modified by another thread (the signal call).

Add High 10 Intra profile support (AVC-Intra)
x264 should now be able to encode compliant AVC-Intra 50.
With a 10-bit-compiled version of x264, a sample commandline for 1080i25 might be:
--interlaced --keyint 1 --vbv-bufsize 2000 --bitrate 50000 --vbv-maxrate 50000 --nal-hrd cbr

Also print "Constrained Baseline" for baseline profile, since that's all x264 (and everything else in the world) supports.
Also reorganize parameter validation a bit to reduce some spurious warnings.

Finish support for high-depth video throughout x264
Add support for high depth input in libx264.
Add support for 16-bit colorspaces in the filtering system.
Add support for input bit depths in the interval [9,16] with the raw demuxer.
Add a depth filter to dither input to x264.

Chroma mode decision/subpel for B-frames
Improves compression ~0.4-1%. Helps more on videos with lots of chroma detail.
Enabled at subme 9 (preset slower) and higher.

Various cosmetics

Make slice-max-size more aggressive in considering escape bytes
The x264 assumption of randomly distributed escape bytes fails in the case of CABAC + an enormous number of identical macroblocks.
This patch attempts to compensate for this.
It is probably safe to assume in calling applications that x264 practically never violates the slice size limitation.

Add missing emms for dump-yuv

Fix CFR ratecontrol with timebase != 1/fps
Fixes VBV + DTS compression, among other things.

Fix DTS/bitrate calculation if the first PTS wasn't zero
Fix bitrate calculation with DTS compression.

Fix regression in r1716

Cosmetics in me.c and frame.c

Add support for arbitrary user SEIs
This allows calling applications to insert SEIs that x264 doesn't know about while maintaining HRD/VBV accuracy.

Add full chroma input flag to swscale
Improves quality of colorspace conversions involving RGB(A).

Add --disable-gpl option to configure
Used for commercially-licensed versions of x264.
Doesn't currently change anything, but may be used to disable GPL-only CLI tools, such as video filters, in the future.
Also print the x264 license and libavformat license in version info.

Update source file headers
Update dates, improve file descriptions, make things more consistent.
Also add information about commercial licensing.

Fix intra refresh to not exceed max recovery_frame_cnt
The spec constrains recovery_frame_cnt to [0, MaxFrameNum-1].
So make MaxFrameNum bigger in the case of intra refresh.

Make intra refresh finish one frame faster
In some cases, the last frame of intra refresh was redundant.
Saves a few bits.

Fix intra refresh to not predict from invalid pixels
The blocks on the right side of the intra refresh column should not predict from top-right.

Add configure check for mingw64 prefixing
This compensates for the inconsistent prefixing seen in different versions of the compiler.

Update some Altivec function prototypes
Silences a lot of warnings.

Add support for level 1b
This level is a stupid hack in the H.264 spec, so it's a stupid hack in x264 too.
Since level is an integer, calling applications need to set level_idc=9 to use it.
String-based option handling will accept "1b" just fine though, so CLI users don't have to worry.

Use smaller values for idr_pic_id
Saves a few bits and fixes problems on certain fantastically terrible decoders,
such as the Apple iPad.

Use POC type 2 for streams with no B-frames
Saves a few bits per slice header.

Faster cabac_encode_ue_bypass
Use CLZ + a lut instead of a loop.

Faster nal_escape asm

Allow --demuxer forcing with known extensions

Minor fixes/cosmeticcs in commandling parsing

Fix overflow in stats printing

Fix bug in 2pass if the first P-frames are all skip
last_qscale_for was read before being initialized in this case, resulting
in the value from the previous iteration being used instead.

Don't do deblock-aware RD if deblocking is off

CAVLC "trellis"
~3-10% improved compression with CAVLC.
--trellis is now a valid option with CAVLC.
Perhaps more importantly, this means psy-trellis now works with CAVLC.

This isn't a real trellis; it's actually just a simplified QNS.
But it takes enough shortcuts that it's still roughly as fast as a trellis; just not quite optimal.
Thus the name is a bit of a misnomer, but we're reusing the option name because it does the same thing.
A real trellis would be better, but CAVLC is much harder to trellis than CABAC.
I'm not aware of any published polynomial-time solutions that are significantly close to optimal.

Add global #define for maximum reference count
This should make it easier to play around with reference frame counts that exceed the spec maximum.

Simplify addressing logic for interlaced-related arrays
In progressive mode, just make [0] and [1] point to the same place.

Add missing emms to x264_nal_encode
Only matters for applications using the low-latency callback feature.

Fix 2 bugs with slice-max-size
Macroblock re-encoding didn't restore mv/tex bit counters (slightly inaccurate 2-pass).
Bitstream buffer check didn't work correctly (insanely large frames could break encoding).

NV12 version of Altivec chroma MC

Deblock-aware RD
Small quality gain (~0.5%) at lower bitrates, potentially larger with QPRD.
May help more with psy, maybe not.
Enabled at subme >= 9.Small speed cost (a few %).

Correct X header path usage in configure
Don't unconditionally set the header path for OpenBSD but do so if the
--enable-visualize flag is specified.

Fix lavf input with delayed frames

Slightly improve the filtering section of x264 --help

Fix debug message typo with DTS compression

Try to guess input length for lavf input
Allows printing of progress indicator when using lavf input.

Workaround bug in fps/timestamp handling with lavf input
reordered_opaque in lavf doesn't work correctly in the identity case (no reordering).
Fixes incorrect output for some file types (e.g. raw in mov).

Fix aspect ratio writing in the MKV muxer
The braindead Matroska spec dictates aspect ratio to be measured in pixels instead of, well, an actual aspect ratio.

Add libavcore check in configure

Improve quantizer distribution with sliced-threads+VBV
Should help avoid cases of very uneven quantizer choice between slices.

Remove dead code in slicetype.c

Fix incorrect duration/framerate/bitrate in flv header

invalidate_reference fixes
invalidate_reference didn't actually invalidate the immediate previous frame, only frames that came before that.
Make sure that reordering is forced when invalidate_reference is used, so that the reference list is correct decoder-side.

Filtering system-related fixes
Fix configure to check for outdated libavutil in resize filter support.
Do not print an explicit error message in ffms when requesting a frame beyond the number of frames in the source.
Mention in --*help that filtering options can be specified as name=value.
Fix the shadowing warning in the resize filter on posix systems.

Improve reference_invalid support
Reference invalidation can now be used to invalidate multiple frames at a time, rather than being limited to one per encoder_encode call.

Eradicate all mention of SI/SP-frames

Fix stack alignment with MB-tree
Broke 2-pass with MB-tree when calling from compilers with broken stack alignment (e.g. MSVC).

Avisynth 2.6 colorspace support
Use a customized avisynth_c.h to detect the new planar colorspaces.

Prevent some cases of cache aliasing.
Avoid cases where image strides were a large power of 2.
Core 2: +3% speed at widths 898..960, +6% at widths 1922..1984, most other resolutions unaffected.
Nehalem and AMD: similar amount of speedup, but fewer resolutions affected.

Fix stack alignment for adaptive quant
Broke calls from compilers with broken stack alignment (e.g. MSVC).

Fix compilation with shared ffmpeg libs
lavf input uses libavutil functions, so it must request flags for libavutil from pkg-config.

Fix another PCM bug
CABAC assumes that NNZ is 0 or 1, not the number of actual nonzero coefficients.
Didn't actually break the output; only had a tiny effect on RD.

Fix regression in r1666
Broke encoding of PCM macroblocks.

Fix build with bit_depth > 8
Definition of x264_cli_plane_copy was inconsistent with declaration.

Convert x264 to use NV12 pixel format internally
~1% faster overall on Conroe, mostly due to improved cache locality.
Also allows improved SIMD on some chroma functions (e.g. deblock).
This change also extends the API to allow direct NV12 input, which should be a bit faster than YV12.
This isn't currently used in the x264cli, as swscale does not have fast NV12 conversion routines, but it might be useful for other applications.

Note this patch disables the chroma SIMD code for PPC and ARM until new versions are written.

Add video filtering system to x264cli
Similar to mplayer's -vf system.
Supports some basic operations like resizing and cropping.Will support more in the future.
See the help for more details.

Eliminate edge cases for MV predictors
Saves a few clocks in mv pred.

Improve scenecut detection a bit
Put a minimum value on the scenecut threshold; makes x264 more likely to catch successive scenecuts (but might increase the odds of false detection).
This also fixes scenecut detection with keyint=infinite.
Also print keyint=infinite in the x264 SEI and statsfile correctly.

Fix 8x8dct+slices+no sliced threads+cavlc+deblock
Deblocking was done slightly incorrectly.
Regression in r1612.

Fix off-by-one error in slice VBV predictor updates

Fix disabling of progress with --log-level

Support for 9 and 10-bit encoding
Output bit depth is specified on compilation time via --bit-depth.
There is currently almost no assembly code available for high-bit-depth modes, so encoding will be very slow.
Input is still 8-bit only; this will change in the future.

Note that very few H.264 decoders support >8 bit depth currently.
Also note that the quantizer scale differs for higher bit depth.For example, for 10-bit, the quantizer (and crf) ranges from 0 to 63 instead of 0 to 51.

Support infinite keyint (--keyint infinite).
This just means x264 won't insert non-scenecut keyframes.
Useful for streaming when using interactive error recovery or some other mechanism that makes keyframes unnecessary.

Also change POC logic to limit POC/framenum LSB size (to save bits per slice).
Also fix a bug in the CPB underflow detection code (didn't affect the bitstream, just resulted in the failure to print certain warning messages).

Don't check i16x16 planar mode unless previous modes were useful
Saves ~160 clocks per MB at subme=1, ~270 per MB at subme>1 (measured on Core i7).
Negligle effect on compression.

Also make a few more arrays static.

Centralize logging within x264cli
x264cli messages will now respect the log level they pertain to.
Slightly reduces binary size.

Make open-GOP Blu-ray compatible
Blu-ray is even more braindamaged than we thought.
Accordingly, open-gop options are now "normal" and "bluray", as opposed to display and coded.
Normal should be used in all cases besides Blu-ray authoring.

Callback feature for low-latency per-slice output
Add a callback to allow the calling application to send slices immediately after being encoded.
Also add some extra information to the x264_nal_t structure to help inform such a calling application how the NAL units should be ordered.

Full documentation is in x264.h.

Simplify pixel_ads

Interactive encoder control: error resilience
In low-latency streaming with few clients, it is often feasible to modify encoder behavior in some fashion based on feedback from clients.
One possible application of this is error resilience: if a packet is lost, mark the associated frame (and any referenced from it) as lost.
This allows quick recovery from errors with minimal expense bit-wise.

The new i_dpb_size parameter allows a calling application to tell x264 to use a larger DPB size than required by the number of reference frames.
This lets x264 and the client keep a large buffer of old references to fall back to in case of lost frames.
If no recovery is possible even with the available buffer, x264 will force a keyframe.

This initial version does not support B-frames or intra refresh.
Recommended usage is to set keyint to a very large value, so that keyframes do not occur except as necessary for extreme error recovery.

Full documentation is in x264.h.

Move DTS/PTS calculation to before encoding each frame instead of after.
Improve documentation of x264_encoder_intra_refresh.

Lookaheadless MB-tree support
Uses past motion information instead of future data from the lookahead.
Not as accurate, but better than nothing in zero-latency compression when a lookahead isn't available.
Currently resets on keyframes, so only available if intra-refresh is set, to avoid pops on non-scenecut keyframes.
Not on by default with any preset/tune combination; must be enabled explicitly if --tune zerolatency is used.

Also slightly modify encoding presets: disable rc-lookahead in the fastest presets.
Enable MB-tree in "veryfast", albeit with a very short lookahead.

Open-GOP support
Allows B-frames immediately prior to keyframes (in display order).
This helps reduce keyframe popping and improve compression with short keyframe intervals.
Due to a staggering display of braindamage in the Blu-ray spec, two open-GOP modes are available.
The two modes calculate keyframe interval differently: one based on coded distance and one based on display distance.
The latter is superior compression-wise, but for no comprehensible reason, Blu-ray requires the former if open-GOP is used.

Use threadpools to avoid unnecessary thread creation
Tiny performance improvement with fast settings and lots of threads.
May help more on some OSs with slow thread creation, like OS X.
Unify inconsistent synchronized abbreviations to sync.

Improve 2-pass bitrate prediction
Adapt based on distance to the end in bits, not in frames.
Helps in videos with absurdly simple end sections, e.g. black frames.

SSE4 and SSSE3 versions of some intra_sad functions
Primarily Nehalem-optimized.

Improve HRD accuracy
In a staggering display of brain damage, the spec requires all HRD math to be done in infinite precision despite the output being of quite limited precision.
Accordingly, convert buffer management to work in units of timescale.
These accumulating rounding errors probably didn't cause any real problems, but might in theory cause issues in very picky muxers on extremely long-running streams.

Use -fno-tree-vectorize to avoid miscompilation
Some versions of gcc have been reported to attempt (and fail) to vectorize a loop in plane_expand_border.
This results in a segfault, so to limit the possible effects of gcc's utter incompetence, we're turning off vectorization entirely.
It's not like it ever did anything useful to begin with.

Fix SIGPIPEs caused by is_regular_file checks
Check to see if input file is a pipe without opening it.

Fix compilation on ARM w/ Apple ABI

Faster mbtree_propagate asm
Replace fp division by multiply with the reciprocal.
Only ~12% faster on penryn, but over 80% faster on amd k8.
Also make checkasm slightly more tolerant to rounding error.

Convert the OPT_ defines in x264.c to an enum

Don't allow baseline profile streams with fake-interlaced
Indicate use of --fake-interlaced in encoding options SEI.

Allocate space for null terminator in param_apply_tune

Fix regression in r1501.
Could cause slightly incorrect analysis in rare cases, but no serious encoding issues.
Also shut up gcc warning about pels_v.

Fix crash with --subme 0 + --weightp > 0. Regression in r1535

Replace some divisions with shifts

Warn about shadowed variable declarations
Also get rid of a few instances of variable shadowing.

Template load_pic_pointers based on interlaced
Significantly speeds up cache_load in the non-interlaced case.
Also various other minor optimizations in cache_load and cache_save.

Remove double-dereferences for MB width/height data
Store it in x264_t instead of going through the SPS.

Exempt Win x86_64 from memalign hack
The API mandates all mallocs are 16 byte aligned.
Remove unused int that stores sizeof malloc in memalign hack.

Preprocessing cosmetics
Unify input/output defines to HAVE_* format.
Define values as 1 to simplify conditionals.

Take more shortcuts in i4x4/i8x8 analysis
Based on the scores of the H and V modes, rule out modes which are unlikely.
Small compression loss (0.1-0.5%) and large speed gain (10-30% faster intra analysis).
Not enabled in slower encoding modes.

Also make C versions of the merged SATD functions in order to eliminate branches based on their availability.

Display SSIM measurement in db as well

indicate "M" for local commits too
:Sun Jun 6 15:21:12 2010 +0800

Add error message for invalid [de]muxer selection

Deduplicate the ALIGN macro, move it to common.h

Fix a use of ALIGNED_ARRAY_16 on ARM

Add missing emms after nal_encode
Caused random, bizarre failures with some calling applications.

Fix crash in fake-interlaced at some resolutions

Fix no-mbtree + aq-mode=0

Regression in r1618.

Add API function to fix x264_picture_t initialization
Calling applications that do not use x264_picture_alloc need to use x264_picture_init to initialize x264_picture_t structures.
Previously, if the calling application didn't zero x264_picture_t, Bad Things could happen.

Fix Avisynth input
Regression in r1624.A more permanent solution to the problem will be committed later.

Convert to a unified "dctcoeff" type for DCT data
Necessary for future high bit-depth support.

Convert to a unified "pixel" type for pixel data
Necessary for future high bit-depth support.
Various macros and extra types have been introduced to make operations on variable-size pixels more convenient.

Add API tool to apply arbitrary quantizer offsets
The calling application can now pass a "map" of quantizer offsets to apply to each frame.
An optional callback to free the map can also be included.
This allows all kinds of flexible region-of-interest coding and similar.

x86 assembly code for NAL escaping
Up to ~10x faster than C depending on CPU.
Helps the most at very high bitrates (e.g. lossless).
Also make the C code faster and simpler.

Re-enable i8x8 merged SATD
Accidentally got disabled when intra_sad_x3 was added.

Some deblocking-related optimizations

Optimize out some x264_scan8 reads

Add fast skip in lookahead motion search
Helps speed very significantly on motionless blocks.

Merge some of adaptive quant and weightp
Eliminate redundant work; both of them were calculating variance of the frame.

Fix omission in libx264 tuning documentation

Fix ultrafast to actually turn off weightb

Fix crash with MP4-muxing if zero frames were encoded

Fix cavlc+deblock+8x8dct (regression in r1612)
Add cavlc+8x8dct munging to new deblock system.
May have caused minor visual artifacts.

Fix 10L in r1612
Stats need to be calculated before deblock strength, not after.
Broke ref stats in x264cli (no affect on actual output).

Overhaul deblocking again
Move deblock strength calculation to immediately after encoding to take advantage of the data that's already in cache.
Keep the deblocking itself as per-row.

Detect Atom CPU, enable appropriate asm functions
I'm not going to actually optimize for this pile of garbage unless someone pays me.
But it can't hurt to at least enable the correct functions based on benchmarks.

Also save some cache on Intel CPUs that don't need the decimate LUT due to having fast bsr/bsf.

Slightly faster mbtree asm

Faster deblock strength asm on conroe/penryn

Avoid an extra var2 in chroma encoding if possible
Also remove a redundant if.

Avoid a redundant qpel check in lookahead with subme <= 1.

Fix ABR rate control calculations
Incorrect frame numbers were used, resulting in slightly inaccurate ratecontrol.

Fix calculation of total bitrate printed after stop by CTRL+C

Fix typo in fake-interlaced documentation

Fix CABAC+PCM, regression in r1592
Changes to queue in CABAC didn't get propagated to PCM code.

Fix performance regression in r1582
Set the correct compiler flags.

Rewrite deblock strength calculation, add asm
Rewrite is significantly slower, but is necessary to make asm possible.
Similar concept to ffmpeg's deblock strength asm.
Roughly one order of magnitude faster than C.
Overall, with the asm, saves ~100-300 clocks in deblocking per MB.

Fix different output with differing sync-lookahead
Also reduce memory consumption.

Mark Win32 executable as large address aware

Add "Fake interlaced" option
This encodes all frames progressively yet flags the stream as interlaced.
This makes it possible to encode valid 25p and 30p Blu-Ray streams.
Also put the pulldown help section in a more appropriate place.

Modify to output to stdout.
Update configure to match.

Set correct filesystem permissions for various files

Fix regression in r1566
Intra stats need to be kept track of for fast intra decision.

Fix rc-lookahead in encoding options SEI in 2-pass with VBV

Reduce memory usage in 2-pass with b-adapt 2

Overhaul CABAC: faster, less cache usage
Horribly munge up the CABAC tables to allow deduplication of some data.
Saves 256 bytes of L1d cache in non-RD, 512 bytes in RD.
Add asm versions of bypass and terminal; save L1i cache by re-using putbyte code.
Further optimize encode_decision.
All 3 primary CABAC functions fit in under 256 bytes of code total on x86_64.

Fix typo in pulldown

Fix bitrate calculation in progress status
Was slightly incorrect due to using pts, which is out of order.

Fix crash with sliced-threads on Phenom

Fix condition for printing rc=cbr in options SEI
Also fix crf-max formatting.

Shrink even more constant arrays

Add API function to trigger intra refresh
Useful for interactive applications where the encoder knows that packet loss has occurred on the client.
Full documentation is in x264.h.

Fix intra refresh behavior with I-frames
Intra refresh still allows I-frames (for scenecuts/etc).
Now I-frames count as a full refresh, as opposed to instantly triggering a refresh.

More cosmetics

Fix unresolved symbol in r1573
gnu ld didn't complain, but some other linkers did.

Remove unnecessary --enable options
Change --enable-visualize to actually check for X11 support.

Don't force row QPs to integer values with VBV
VBV should no longer raise the bitrate of the video.That is, at a given quality level or average bitrate, turning on VBV should only lower the bitrate.
This isn't quite true if adaptive quant is off, but nobody should be doing that anyways.
Also may result in slightly more accurate per-row VBV ratecontrol.

Add field-order detection to y4m demuxer

Fix sliced-threads + interlaced
Broken in r1546.

Improve temporal MV prediction
Predict based on the results of p16x16 search, not final MVs.
This lets us get predictions even if mode decision chose intra.
Also improves cache coherency.

More accurate MV prediction on edges in lookahead

Error out on invalid input stride
Might catch some crashes due to buggy calling applications.

Remove unnecessary debugging assert
Shouldn't have been in r1568 to begin with.

Shrink some more constant arraysr

Deduplicate asm constants, automate name prefixing
Auto-prefix global constants with x264_ in cextern.
Eliminate x264_ prefix from asm files; automate it in cglobal.
Deduplicate asm constants wherever possible to save data cache (move them to a new const-a.asm).
Remove x264_emms() entirely on non-x86 (don't even call an empty function).
Add cextern_naked for a non-prefixed cextern (used in checkasm).

Shrink a few x86 asm functions
Add a few more instructions to cut down on the use of the 4-byte addressing mode.

Make options SEI use weight* instead of wpred*
More intuitive and maps more reasonably to the CLI options.
Breaks statsfile backwards-compatibility.

r1548 broke subme < 3 + p8x8/b8x8
Caused significantly worse compression.Preset-wise, only affected veryfast.
Fixed by not modifying mvc in-place.

More write-combining

Reduce lookahead memory usage, cache misses
Merge lowres_types with lowres_costs.

Fix build on x86 with asm on but SSE off

Don't calculate ref/partition stats if not necessary

Split out MV prediction into mvpred.c
Make common/macroblock.c a bit less gigantic.

Fix mv predictor clipping on non-x86 (regression in r1548)

Move getopt.c to x264cli sources from libx264
Only affects builds on systems without getopt.c.

Move deblocking code to a separate file
Should clean up frame.c a bit.

fix ffms demuxer to support input timebase values > 2^31

Fix 10l in cache_load changes
Broke constrained intra pred, probably not anything else.

Faster fullpel predictor checking
Also shave a few instructions off dia/hex motion estimation loops.

Fix checkasm's generation of deblock inputs (regression in r1517)

Fix printing of bitrate when timestamps aren't available
Doesn't affect x264cli, but was broken in some other apps in CFR mode.

Don't check mv0 twice
One less SAD in motion estimation.
Also rename bmv -> pmv; more accurate naming.

Remove reordering restrictions from weightp
Apparently the spec does allow two consecutive copies of the same frame in the reference list.
This involves an incredibly ugly hack to wrap around the frame number.
Very slight compression improvement.

Print intra chroma pred modes in stats

Add mv0 special case in pskip chroma MC
Significantly faster pskip MC.

Fix build scripts to work with non-GNU tools

Faster deblock reference frame checks
Use a lookup table to simplify logic

Faster chroma CBP handling

Fix issues with extremely large timebases
With timebase denominators >= 2^30 , x264 would silently overflow and cause odd issues.
Now x264 will explicitly fail with timebase denominators >= 2^31 and work with timebase denominators 2^31 > x >= 2^30.

MMX code for predictor rounding/clipping
Faster predictor checking at subme < 3.

Fix four minor bugs found by Clang

Move deblocking/hpel into sliced threads
Instead of doing both as a separate pass, do them during the main encode.
This requires disabling deblocking between slices (disable_deblock_idc == 2).
Overall performance gain is about 11% on --preset superfast with sliced threads.
Doesn't reduce the amount of actual computation done: only better parallelizes it.

Prefetch MB data in cache_load
Dramatically reduces L1 cache misses.
~10% faster cache_load.

Fix a ton of pessimization caused by aliasing in cache_save and cache_load

Add CP128/M128 macros using SSE

Fix various early terminations with slices
Neighbouring type values (type_top, etc) are now loaded even if the MB isn't available for prediction.
Significant overall performance increase (as high as 5-10%+) with lots of slices (e.g. with slice-max-size).

Enable --fast-pskip on fast firstpass

Make interlaced detection in avisynth only apply to field-based input
Fixes improper flagging of progressive sources.

Set psy=0 in lossless mode
Doesn't actually affect output, just what's written in the SEI.

Fix a use of sad_x4 that had non-mod64 stride
Minimal speed improvement, but fixes a violation of internal api.

Make keyint_min auto by default
Gives more reasonable default settings when using short GOPs.

Faster mv predictor checking at subme < 3
Simplify the predicted MV cost check.

Special case in qpel refine for subme=1
~15-20% faster qpel refine with subme=1.
Some minor cleanups in refine_supel.

Cosmetics: VLC tables

Add faster mv0 special case for macroblock-tree
Improves performance on low-motion video.

Add miscompilation check for x264_clz
Running a Phenom-optimized build of x264 (e.g. -march=amdfam10) on a non-Phenom CPU didn't SIGILL; instead it would silently produce incorrect output.
Now, instead, it will error out loudly.

Fixing floating-point exception in level-checking
Doesn't cause any issues for x264cli, but might impact some calling apps that care (e.g. Delphi apps).

Save a few bits in multislice encoding
Set the initial QP for each slice to the last QP of the previous slice.

Early termination in 16x8/8x16 search
Combine the actual cost of the first partition with the predicted cost of the second to avoid searching the second when possible.
Reduces the number of times the second partition is searched by up to ~75% in non-RD mode, ~10% in RD mode.
Negligible effect on compression.

Make MV prediction work across slice boundaries
Should improve motion search with lots of small slices, e.g. with slice-max-size.
Still restricted by sliced threads (won't cross the boundary between two threadslices).
The output-changing part of the previous patch.

Cleanup and simplification of macroblock_load
Doesn't do anything now, but will be useful for many future changes.
Splitting out neighbour calculation will make MBAFF implementation easier.
Calculation of neighbour_frame value (actual neighbouring MBs, ignoring slices) will be useful for some future patches.

Add missing #include to display-x11.c

Add TFF/BFF detection to all demuxers
Fix interlaced Avisynth input, automatically weave field-based input.

Correctly mark output frames as BREF
Simplify pic_out code.

Fix HRD compliance
As usual, the spec is so insanely obfuscated that it's impossible to get things right the first time.

Better b16x8/8x16 early termination in B-frames
A bit slower but up to 1-2% better compression.

Fix 10L in B-skip improvement patch

Fix printing of SEI header with VBV + ABR
SEI header shouldn't say CBR unless bitrate == maxrate.

Simplify slicetype_frame_cost
Avoid redundant calculations when VBV is on (due to the intra-only call).
Move most of the logic into per-MB code.

Faster CABAC state copying for small partitions
Save ~25 clocks per i4x4, i8x8, and sub8x8 RD call.

Massive cosmetic and syntax cleanup
Convert all applicable loops to use C99 loop index syntax.
Clean up most inconsistent syntax in ratecontrol.c, visualize, ppc, etc.
Replace log(x)/log(2) constructs with log2, and similar with log10.
Fix all -Wshadow violations.
Fix visualize support.

Fix array overread in b8x16 search

Faster direct check with subpartitions off
Also simplify the whole function a bit.

Print crf-max with appropriate precision in SEI

Fix 10l in timecode seeking

Fix 10L: Remove needless error check
This error check was for cfr input + --timebase, but that doesn't happen, and brings about a bug with vfr input.

Don't use 2 L1 refs with pyramid + ref=1
Slightly faster encoding with ref=1.

Update copyright year in SEI header

New "superfast" preset, much faster intra analysis

Especially at the fastest settings, intra analysis was taking up the majority of MB analysis time.
This patch takes a ton more shortcuts at the fastest encoding settings, decreasing compression 0.5-5% but improving speed greatly.
Also rearrange the fastest presets a bit: now we have ultrafast, superfast, veryfast, faster.
superfast is the old veryfast (but much faster due to this patch).
veryfast is between the old veryfast and faster.
faster is the same as before except with MB-tree on.

Encoding with subme >= 5 should be unaffected by this patch.

Avoid redundant MV prediction in duplicate refs

Cosmetics in mvd handling
Use a 2D array instead of doing manual pointer arithmetic.

Fix make uninstall on systems with executable suffixes

Add tune for still image compression
There has been some demand for this from companies looking to use x264 for still image compression (it can outperform JPEG or JPEG-2000 by a factor of 2 or more).
Still image compression is a bit different; because temporal stability isn't an issue, we can get away with far more powerful psy settings.

Pad non-mod16 resolutions using the correct field

Improves compression of interlaced videos with non-mod16 heights.

Document slow/fast firstpass in --fullhelp

Fix some misattributions in profiling
Cycles spent in load_hadamard and the avg2 w16 ssse3 cacheline split code were misattributed.

Much faster non-RD intra analysis
Since every pred mode costs at least 1 bit, move that part into the initial SATD cost.
This lets i4x4/i8x8 analysis terminate earlier.
If the cost of the predicted mode is less than the cost of signalling any other mode, early-terminate the analysis.

Fix stack alignment in sliced threads
Could cause crashes when called from non-GCC-compiled applications.

Cosmetics: use sizeof() where appropriate

Split up analyse_init
Save some time by avoiding some unnecessary inits and moving other parts to per-thread init.

Reduce stack usage of b-adapt 2's trellis
Also remove some redundant code.

Various motion estimation optimizations
Faster method of checking MV range.
Predict MVs and cache MVs/MVDs for bidir qpel-RD.
A whole bunch of other minor optimizations.
Slightly better performance and compression.

Overhaul macroblock_cache_rect
Unify the rectangle functions into a single one similar to ffmpeg's fill_rectangle.
Remove all cases of variable-size cache_rect calls; create a function-pointer-based system for handling such cases.
Should greatly decrease code size required for such calls.

Make a bunch of small functions ALWAYS_INLINE
Probably no real effect for now, but needed for the next patch.

Two compatibility fixes
Add IA64 support in configure.

Faster x264_macroblock_encode_pskip
GCC is apparently unable to optimize out the calculation of a variable when it isn't used.

Much more accurate B-skip detection at 2 < subme < 7
Use the same method that x264 uses for P-skip detection.
This significantly improves quality (1-6%), but at a significant speed cost as well (5-20%).
It also may have a very positive visual effect in cases where the inaccurate skip detection resulted in slightly-off vectors in B-frames.
This could cause slight blurring or non-smooth motion in low-complexity frames at high quantizers.
Not all instances of this problem are solved: the only universal solution is non-locally-optimal mode decision, which x264 does not currently have.

subme >= 7 or <= 2 are unaffected.

Reformat profile restrictions in --fullhelp.

Put "no interlaced", "no lossless" on their own line to avoid them
running into the default options list.

Fix typo in configure

Add support for spaces to iPhone GAS preprocessor script

Fix slightly wrong mp4 duration.

Fix link errors with newest gpac cvs
gpac decided to randomly break API and require us to use their own custom malloc and free.

Save a few bits in slice headers
Don't override the maximum ref index in the slice header if it's the same as the default.
Also update the naming of the relevant variables in the PPS.

Shrink some arrays in x264_t
Also remove an unnecessary assignment from cache_load.

Use x264_log in more places instead of fprintf

Fix two nondeterminisms
Move noise reduction data into thread-specific data.
Use correct reference list for L1 temporal predictors.

"CRF-max" support with VBV
This is a rather curious feature that may have more use than is initially obvious.
In CRF mode with VBV enabled, CRF-max allows the user to specify a quality level which the encoder will never go below, even due to the effects of VBV.
This is not the same as qpmax, which is not aware of issues like scene complexity.
Setting this WILL cause VBV underflows in any situation where the encoder would have needed to exceed the relevant CRF to avoid underflow.

Why might one want to do this even if it would cause VBV underflows?
In the case of streaming, particularly ultra-low-latency streaming, it may be preferable to drop frames than to display frames that are of too low a quality.
Thus, in extremely complex scenes, rather than display completely awful video, the streaming server could simply drop to a lower framerate.
Scenecuts, which normally look terrible under situations like single-frame VBV, could be handled by just displaying them a bit later and dropping frames to compensate.
In other words, it's better to see the scenecut 150ms delayed than for it to look like a blocky mess for 150ms.

On the caller-side, this would be handled by detecting the output size of x264's frames and dropping future frames to compensate if necessary.

This can also be used in normal encoding simply to ensure that VBV does not hurt quality too much (at the cost of potentially causing underflows).
This can help quite a lot when using single-frame VBV and sliced threads, where VBV can often be somewhat unstable.

Blu-ray support: NAL-HRD, VFR ratecontrol, filler, pulldown
x264 can now generate Blu-ray-compliant streams for authoring Blu-ray Discs!
Compliance tested using Sony BD-ROM Verifier 1.21.
Thanks to The Criterion Collection for sponsoring compliance testing!

An example command, using constant quality mode, for 1080p24 content:
x264 --crf 16 --preset veryslow --tune film --weightp 0 --bframes 3 --nal-hrd vbr --vbv-maxrate 40000 --vbv-bufsize 30000 --level 4.1 --keyint 24 --b-pyramid strict --slices 4 --aud --colorprim "bt709" --transfer "bt709" --colormatrix "bt709" --sar 1:1 <input> -o <output>

This command is much more complicated than usual due to the very complicated restrictions the Blu-ray spec has.
Most options after "tune" are required by the spec.
--weightp 0 is not, but there are known bugged Blu-ray player chipsets (Mediatek, notably) that will decode video with --weightp 1 or 2 incorrectly.
Furthermore, note the Blu-ray spec has very strict limitations on allowed resolution/fps combinations.
Examples include 1080p @ 24000/1001fps (NTSC FILM) and 720p @ 60000/1001fps.

Detailed features introduced in this patch:

Full NAL-HRD compliance, with both VBR (no filler) and CBR (filler) modes.
Can be enabled with --nal-hrd vbr/cbr.
libx264 now returns HRD timing information to the caller in the form of an x264_hrd_t.
x264cli doesn't currently use it, but this information is critical for compliant TS muxing.

Full VFR ratecontrol support: VBV, 1-pass ABR, and 2-pass modes.
This means that, even without knowing the average framerate, x264 can achieve a correct bitrate in target bitrate modes.
Note that this changes the statsfile format; first pass encodes make before this patch will have to be re-run.

Pulldown support: libx264 allows the calling application to specify a pulldown mode for each frame.
This is similar to the way that RFFs (Repeat Field Flags) work in MPEG-2.
Note that libx264 does not modify timestamps: it assumes the calling application has set timestamps correctly for pulldown!
x264cli contains an example implementation of caller-side pulldown code.

Pic_struct support: necessary for pulldown and allows interlaced signalling.
Also signal TFF vs BFF with delta_poc_bottom: should significantly improve interlaced compression.
--tff and --bff should be preferred to the old --interlaced in order to tell x264 what field order to use.

Huge thanks to Alex Giladi and Lamont Alston for their work on code that eventually became part of this patch.

Timecode input/output
--tcfile-in allows a user to specify a timecode v1 or v2 file to override input timestamps.
Useful for dealing with VFR input, especially when FFMS/LAVF support isn't available.
--tcfile-out writes a timecode v2 file containing the timecodes of the output file.
New --timebase option allows a user to change the stream timebase.
Intended primarily for forcing timebase with timecode files if necessary.
When using --seek, note that x264 will seek in the timecode file as well.

Mixed-refs support for B-frames
Small speed cost, usually a few percent at most. Generally has lowest cost in cases when it isn't very useful. Up to ~2% better compression overall on highly complex sources.

Also fix a few minor bugs in B-frame analysis and various bits of cleanup.

Faster rounding of chroma DC coefficients

Faster cabac_encode_decision_asm
Minimizes instruction count, which also means smaller code.
Various other slight changes to allow more instruction level parallelism.

Faster hpel_filter
On ssse3, use pmaddubsw for h filter too (similar to v filter).
Change 32-bit v and c filters to write the result non-temporal.
Add commented-out defines to disable non-temporal operation.
Hardly any black magic here, but still a measurable win especially for ssse3.

Ignore XYSCSS in y4m if the newer standard C tag is present

Apparently y4mscaler will generate 4:2:0 files with XYSCSS set to 444

Fix regression in r1450
I_PCM blocks would cause x264 to crash or generate bad output. Simplify PCM handling.

Fix crash with intra-refresh + aq-mode 0

Fix regression in r1453
r1453 broke psy-trellis with --trellis 2

Fix regression in r1449
Incorrectly placed thread MV check could result in rare thread MV internal errors, esp. with --non-deterministic.
These weren't fatal errors (x264 could recover and continue with slight compression loss).

Cut size of MVD arrays by a factor of 2 again
Only store the MVDs of the edges of each MB.

Thanks to Michael Niedermayer for the idea.

Disable Altivec and VIS optimizations when --disable-asm is specified

Fix a buffer overread on odd input resolutions

Fix one bug, one corner case in VBV
qp_novbv wasn't set correctly for B-frames.
Disable ABR code for frames with zero complexity.
Disable ABR code for CBR mode; it is completely unnecessary and can have negative consequences.

Port Mans Rullgard's NEON intra prediction functions from ffmpeg

Remove unused function
Two other minor fixes.

Use short startcode in more possible situations
Previous patch didn't cover all possible uses according to B.1.2.

Fix fastfirstpass
Apparently the libx264 preset changes made "fastfirstpass" into "fastsecondpass" inadvertantly.

Fix various silly errors in the previous patches

Actually error out if preset/tune/profile is invalid
Got lost somewhere in the move to libx264-based presets.

Faster probe_skip, 2x2 DC transform handling
Move the 2x2 DC DCT into the dct_dc asm function to avoid some store-to-load forwarding penalties and extra register loads.
Use dct_dc as part of the early termination in probe_skip.
x86 asm partially by Holger Lubitz.
ARM NEON asm by David Conrad.

Use short startcodes whenever possible
Saves one byte per frame for every slice beyond the first.
Only applies to Annex-B output mode.

New algorithm for AQ mode 2
Combines the auto-ness of AQ2 with a new var^0.25 instead of log(var) formula.
Works better with MB-tree than the old AQ mode 2 and should give higher SSIM.

Abide by the MinCR level limit
Some Blu-ray analyzers were complaining about this.

Make b-pyramid normal the default
Now that b-pyramid works with MB-tree and is spec compliant, there's no real reason not to make it default.
Improves compression 0-5% depending on the video.
Also allow 0/1/2 to be used as aliases for none/strict/normal (for conciseness).

Move presets, tunings, and profiles into libx264
Now any application calling libx264 can use them.
Full documentation and guidelines for usage are included in x264.h.

Faster, more accurate psy-RD caching
Keep more variants of cached Hadamard scores and only calculate them when necessary.
Results in more calculation, but simpler lookups.
Slightly more accurate due to internal rounding in SATD and SA8D functions.

Much faster and more efficient MVD handling
Store MV deltas as clipped absolute values.
This means CABAC no longer has to calculate absolute values in MV context selection.
This also lets us cut the memory spent on MVDs by a factor of 2, speeding up cache_mvd and reducing memory usage by 32*threads*(num macroblocks) bytes.
On a Core i7 encoding 1080p, this is about 3 megabytes saved.

Add temporal predictor support to interlaced encoding
0.5-1% better compression in interlaced mode

Keep track of macroblock partitions
Allows vastly simpler motion compensation and direct MV calculation.

Much faster and simpler direct spatial calculation

SimpleBlock requires Matroska Doctype v2

Add GPAC version check

Fix stupid regression in interlaced in r1430
With ref > 8 or b-pyramid, an array over-read could cause slightly incorrect B-frames.

Fix overread of scratch buffer
Could cause crashes on non-mod16 frames.

Fix integer overflow in chroma SSD check
Could cause bad skips at very high quantizers on extreme inputs.

Fix I and B-frame QPs with threads
Rounding errors resulted in slightly wrong QPs with threads enabled.

Fix compilation on ARM

Remove unnecessary PIC support macros
yasm has a directive to enable PIC globally

Don't even try direct temporal when it would give junk MVs
In PbBbP pyramid structure, the last "b" cannot use temporal because L0Ref0(L1Ref0) != L0Ref0.
Don't even bother analyzing it, just use spatial.
Should improve speed and direct auto effectiveness in CRF and 1-pass modes when b-pyramid is used.
Also makes --direct temporal useful with --b-pyramid, since it will fall back to spatial for frames where temporal is broken.

iPhone compilation support
Also add --sysroot to configure options

To build for iPhone 3gs / iPod touch 3g:
CC=/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/gcc ./configure --host=arm-apple-darwin --sysroot=/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS3.0.sdk

For older devices, add
--extra-cflags='-arch armv6 -mcpu=arm1176jzf-s' --extra-ldflags='-arch armv6' --disable-asm

ARM NEON versions of weightp functions

Use #ifdef instead of #if in checkasm

Make the ABR buffer consider the distance to the end of the video
Should improve bitrate accuracy in 2-pass mode.
May also slightly improve quality by allowing more variation earlier-on in a file.

Also fix abr_buffer with 1-pass: it does something very different than what it does for 2-pass.
Thus, the earlier change that increased it based on threads caused 1-pass ABR to be somewhat less accurate.

Mark cli_input/output_t variables as const when possible

mkv: Write the x264 version into the file header

This only updates the "writing application"; matroska_ebml.c is the
"muxing application", but the version string for that is still hardcoded.

mkv: Write SimpleBlock instead of Block for frame headers

mkvtoolnix writes these by default since 2009/04/13.
Slightly simplifies muxer and allows 'mkvinfo -s' to show B-frames
as 'B' (but not B-ref frames).

Allow | as a separator between psy-rd and psy-trellis values.
[,:/] are all taken when setting psy-trellis in a zone in an mencoder option.

Also fix a comment typo and remove a useless line of code.

Backport various speed tweak ideas from ffmpeg
Add mv0 early termination to spatial direct calculation
Up to twice as fast direct mv calculation on near-motionless video.

Branchless CAVLC level code adjustment based on trailing ones.
A few clocks faster.

Check tc value before clipping in C version of deblock functions.
Much faster, but nobody uses those anyways.

Thanks to Michael Niedermayer for the ideas.

Implement direct temporal + interlaced
This was much easier than I expected.
It will also be basically useless until TFF/BFF support gets in, since it requires delta_poc_bottom to be set correctly to work well.

Allow longer keyints with intra refresh
If a long keyint is specified (longer than macroblock width-1), the refresh will simply not occur all the time.
In other words, a refresh will take place, and then x264 will wait until keyint is over to start another refresh.

Overhaul sliced-threads VBV
Make predictors thread-local and allow each thread to poll the others to get their predicted sizes.
Many, many other tweaks to improve quality with small VBV and sliced threads.
Note this may somewhat increase the risk of a VBV underflow in such extreme situations (single-frame VBV).
This is tolerable, as most relevant use-cases are better off with a few rare underflows (even if they have to drop a slice) than consistent low quality.

Print psy-(rd|trellis) with more precision in userdata SEI

More formatting fixes in x264 help

Faster 2x2 chroma DC dequant

Write PASP atom in mp4 muxing
Adds container-level aspect ratio support for mp4.

Fix 2-pass ratecontrol continuation in case of missing statsfile
Didn't work properly if MB-tree was enabled.

Smarter QPRD
Catch some cases in which RD checks can be avoided; reduces QPRD RD calls by 10-20%.

Fix subpel iteration counts with B-frame analysis and subme 6/8
Since subme 6 means "like subme 5, except RD on P-frames", B-frame analysis
shouldn't use the RD subpel counts at subme 6.Similarly with subme 8.
Slightly faster (and very marginally worse) compression at subme 6 and 8.

Simplify decimate checks in macroblock_encode
Also fix a misleading comment.

Improve bidir search, fix some artifacts in fades
Modify analysis to allow bidir to use different motion vectors than L0/L1.
Always try the <0,0,0,0> motion vector for bidir.
Eliminates almost all errant motion vectors in fades.
Slightly improves PSNR as well (~0.015db).

Slightly faster predictor_difference_mmxext

Add ability to adjust ratecontrol parameters on the fly
encoder_reconfig and x264_picture_t->param can now be used to change ratecontrol parameters.
This is extraordinarily useful in certain streaming situations where the encoder needs to adapt the bitrate to network circumstances.

What can be changed:
1) CRF can be adjusted if in CRF mode.
2) VBV maxrate and bufsize can be adjusted if in VBV mode.
3) Bitrate can be adjusted if in CBR mode.
However, x264 cannot switch between modes and cannot change bitrate in ABR mode.

Also fix a bug where x264_picture_t->param reconfig method would not always be frame-exact.

Commit sponsored by SayMama video calling.

Fix regression in r1406
Bitrate was printed incorrectly for some input framerates.

Fix log2f detection, include order, some gcc warnings
r1413 caused crashes on any system with malloc.h.
Also switch to std=c99 or std=gnu99 if supported by the compiler.
Fix visualize support.

Fix abstraction violations in x264.c
No calling application--not even x264cli--should ever look inside x264_t.

Move -D CFLAGS to config.h

Fix stat with large file support

Implement ffms2 version check
Depends on ffms2 version 2.13.1 (r272).
Tries pkg-config's built-in version checking first.
Uses only the preprocessor to avoid cross-compilation issues.

Fix implicit CBR message to only print when in ABR mode
Also make it print outside of debug mode.

Add configure check for log2 support
Some incredibly braindamaged operating systems, such as FreeBSD, blatantly ignore the C specification and omit certain functions that are required by ISO C.
log2f is one of these functions that periodically goes missing in such operating systems.

Add config.log support
Now, if configure fails, you'll be able to see why.

Fix cross-compiling with lavf, add support for ffms2.pc
Also update configure script to work with newest ffms.

Improve DTS generation, move DTS compression into libx264
This change fixes some cases in which PTS could be less than DTS.
Additionally, a new parameter, b_dts_compress, enables DTS compression.
DTS compression eliminates negative DTS (i.e. initial delay) due to B-frames.
The algorithm changes timebase in order to avoid duplicating DTS.
Currently, in x264cli, only the FLV muxer uses it.The MP4 muxer doesn't need it, as it uses an EditBox instead.

Various threading-related cosmetics
Simplify a lot of code and remove some unnecessary variables.

Hardcode the bs_t in cavlc.c; passing it around is a waste
Saves ~1.5kb of code size, very slight speed boost.

Fix lavf input with pipes and image sequences
x264 should now be able to encode from an image sequence using an image2-style formatted string (e.g. file%02d.jpg).

Fix bitstream alignment with multiple slices
Broke multi-slice encoding on CPUs without unaligned access.
New system simply forces a bitstream realignment at the start of each writing function and flushes when it reaches the end.

Merge nnz_backup with scratch buffer
Slightly less memory usage.

Use cross-prefix properly with pkg-config for cross-compiling

Various performance optimizations
Simplify and compact storage of direct motion vectors, faster --direct auto.
Shrink various arrays to save a bit of cache.
Simplify and reorganize B macroblock type writing in CABAC.
Add some missing ALIGNED macros.

Fix crash on new AMD M300 and similar CPUs
Apparently these CPUs have SSE4a, but not misaligned SSE.

Fix intra refresh with subme < 6
Also improve the quality of intra masking.

Add support for multiple --tune options
Tunes apply in the order they are listed in the case of conflicts.
Psy tunings, i.e. film/animation/grain/psnr/ssim, cannot be combined.
Also clarify --profile, which forces the limits of a profile, not the profile itself.

Various bugfixes and tweaks in analysis
Fix the oldest-ever bug in x264: b16x8 analysis used the wrong width for predict_mv.
Fix cache_ref calls for slightly better MV prediction in bsub16x16 analysis.
Make B-partition analysis consider reference frame costs.
Various other minor changes.
Overall very slightly improved mode decision and motion search in B-frames.

More --me tesa optimizations

Fix typo in configure

Make --fps force CFR mode

Eliminate intentional array overflow in quant matrix handling
While it probably never caused problems, it was incredibly ugly and evil.

Faster --me tesa

Fix static pthreads + dynamically linked x264 on win32
Add the necessary static pthread initialization code to a new DLLmain function.

Add getopt_long to the included getopt.c
Fixes option handling on OSs that have a nonworking/missing getopt (e.g. Solaris).

Faster psy-trellis init
Remove some unncessary zigzags.

Simplfy intra mode availability handling
Slightly faster, 1.5kb smaller binary size, less code.

Fix free callback, add x264_encoder_parameters function
x264 would try to use the passed param struct after freeing if the param_free callback was set.
Probably didn't cause any issues, as probably no programs used the callback in this location yet.

A new x264_encoder_parameters function is now available in the API.
This function lets the calling application grab the current state of the encoder's parameters.
Use this in x264cli to ensure that the param struct used for set_param is updated with whatever changes x264_encoder_open has made to it.

Patch partially by Anton Mitrofanov <>.

Fix x264 compilation on Apple GCC
Apple's GCC stupidly ignores the ARM ABI and doesn't give any stack alignment beyond 4.

Faster weightp motion search
For blind-weight dupes, copy the motion vector from the main search and qpel-refine instead of doing a full search.
Fix the p8x8 early termination, which had unexpected results when combined with blind weighting.
Overall, marginally reduces compression but should potentially improve speed by over 5%.

More correct padding constants for lowres planes
Since lowres analysis isn't interlace-aware, we don't need to double the vertical padding for interlaced video.

Fix some invalid reads caught by valgrind
Temporal predictor calculation was misled by invalid reference counts for I-frames.

Periodic intra refresh
Uses SEI recovery points, a moving vertical "bar" of intra blocks, and motion vector restrictions to eliminate keyframes.
Attempt to hide the visual appearance of the intra bar when --no-psy isn't set.
Enabled with --intra-refresh.
The refresh interval is controlled using keyint, but won't exceed the number of macroblock columns in the frame.
Greatly benefits low-latency streaming by making it possible to achieve constant framesize without intra-only encoding.
Combined with slice-max size for one slice per packet, tests suggest effective resiliance against packet loss as high as 25%.
x264 is now the best free software low-latency video encoder in the world.

Accordingly, change the API to add b_keyframe to the parameters present in output pictures.
Calling applications should check this to see if a frame is seekable, not the frame type.

Also make x264's motion estimation strictly abide by horizontal MV range limits in order for PIR to work.
Also fix a major bug in sliced-threads VBV handling.
Also change "auto" threads for sliced threads to "cores" instead of "1.5*cores" after performance testing.
Also simplify ratecontrol's checking of first pass options.
Also some minor tweaks to row-based VBV that should improve VBV accuracy on small frames.

LAVF/FFMS input support, native VFR timestamp handling
libx264 now takes three new API parameters.
b_vfr_input tells x264 whether or not the input is VFR, and is 1 by default.
i_timebase_num and i_timebase_den pass the timebase to x264.

x264_picture_t now returns the DTS of each frame: the calling app need not calculate it anymore.

Add libavformat and FFMS2 input support: requires libav* and ffms2 libraries respectively.
FFMS2 is _STRONGLY_ preferred over libavformat: we encourage all distributions to compile with FFMS2 support if at all possible.
FFMS2 can be found at
--index, a new x264cli option, allows the user to store (or load) an FFMS2 index file for future use, to avoid re-indexing in the future.

Overhaul the muxers to pass through timestamps instead of assuming CFR.
Also overhaul muxers to correctly use b_annexb and b_repeat_headers to simplify the code.
Remove VFW input support, since it's now pretty much redundant with native AVS support and LAVF support.
Finally, overhaul a large part of the x264cli internals.

--force-cfr, a new x264cli option, allows the user to force the old method of timestamp handling.May be useful in case of a source with broken timestamps.
Avisynth, YUV, and Y4M input are all still CFR.LAVF or FFMS2 must be used for VFR support.

Do note that this patch does *not* add VFR ratecontrol yet.
Support for telecined input is also somewhat dubious at the moment.

Large parts of this patch by Mike Gurlitz <>, Steven Walters <>, and Yusuke Nakamura <>.

More help typo fixes

Fix x264_clz on inputs > 1<<31
(though x264 never generates such inputs)

Don't do sum/ssd analysis if weightp == 1
Typo fixes in comments and help.

Fix two bugs in 2-pass ratecontrol
last_qscale_for wasn't set during the 2pass init code.
abr_buffer was way too small in the case of multiple threads, so accordingly increase its buffer size based on the number of threads.
May significantly increase quality with many threads in 2-pass mode, especially in cases with extremely large I-frames, such as anime.

Avisynth-MT and 2.6 compatibility fixes
Explain to the user why YV12 conversion is forced with Avisynth 2.6.
Fix encoding with Avisynth-MT scripts by inserting the necessary Distributor() call; speeds such scripts back up to expected levels.

Fix zone parsing on mingw
Due to MinGW evidently being in the hands of a pack of phenomenal idiots, MinGW does not have strtok_r, a basic string function.
As such, remove the dependency on strtok_r in zone parsing.
Previously, using zones for anything other than ratecontrol failed.

More lookahead optimizations
Under subme 1, don't do any qpel search at all and round temporal MVs accordingly.
Drop internal subme with subme 1 to do fullpel predictor checks only.
Other minor optimizations.

missing changes from previous commits

Fix regression in direct=auto/temporal in r1364
Bug caused rare race condition in frame reference handling.
This resulted in invalid bitstreams in some B-frames and, very rarely, crashes.

Add fast pskip to x264 SEI info header

Minor seeking fix with Avisynth input
Seeking past the end of the input with --seek would result in the same frame being repeated over and over.

Add support for MB-tree + B-pyramid
Modify B-adapt 2 to consider pyramid in its calculations.
Generally results in many more B-frames being used when pyramid is on.
Modify MB-tree statsfile reading to handle the reordering necessary.
Make differing keyint or pyramid between passes into a fatal error.

Use aliasing-avoidance macros in array_non_zero

MMX version of 8x8 interlaced zigzag
Just as fast as SSSE3 on Nehalem (and faster on Conroe/Penryn), so remove the SSSE3 version.

Bring back slice-based threading support
Enabled with --sliced-threads
Unlike normal threading, adds no encoding latency.
Less efficient than normal threading, both performance and compression-wise.
Useful for low-latency encoding environments where performance is still important, such as HD videoconferencing.
Add --tune zerolatency, which eliminates all x264 encoder-side latency (no delayed frames at all).
Some tweaks to VBV ratecontrol and lookahead (in addition to those required by sliced threading).
by a media streaming company that wishes to remain anonymous.
:Mon Dec 7 18:17:29 2009 -0800

Add more detailed help for presets/tunes/profiles
Shows what options they represent.

qpel RD no longer needs mbcmp_unaligned

ensure that all boolean options are {0,1} so they print consistently in the options SEI

Actually do r1356
Somehow commit r1356 got lost in the ether.I'm not sure how, but now it's fixed.

Remove some unused code from x264.c

SSSE3 version of zigzag_8x8_field
Slightly faster interlaced encoding with 8x8dct.
Helps most on Nehalem, somewhat disappointing on Conroe/Penryn.

Fix crash in interlaced with >8 refs
Crash introduced in weightp.

Significantly faster qpel-RD
Cache the results of MC, like in bidir-RD.
Slightly changes output due to the necessary reordering of satd/RD calls.
5-10% faster qpel-RD.

Add x264 prefix to functions with ffmpeg equivalents
Not important now, but will be when we add libav* input support.

10L in r1353
Broke mp4 output.

Enhanced Avisynth input support
Requires avisynth_c.h from the Avisynth API headers.
Reports errors properly from Avisynth script input.
Automatically construct input scripts for almost any input file.
Tries ffmpegsource2, DSS2, directshowsource, and many other sourcing methods, based on the input file extension.
Automatically converts to YV12.

Much faster weightp
Move sum/ssd calculation out of lookahead and do it only once per frame.
Also various minor optimizations, cosmetics, and cleanups.

Fix bugs in fps/timestamp handling in FLV muxer

Fix bug in weightp analysis
Weights weren't reset upon early terminations, so old (wrong) weights could stick around.
Small compression improvement.

Minor deblocking optimization, update comments

Fix weightb with delta_poc_bottom
Has no effect yet, but will be required once we add TFF/BFF signalling support in interlaced mode.
Gives 0.5-0.7% better compression with proper TFF/BFF signalling.

Give more meaningful error if 1st/2nd pass resolution differ

Fix extremely rare deadlock with sync-lookahead
Patch partially by Anton Mitrofanov.

Only print weightp stats if there were P-frames

Faster lookahead with subme=1
If it hasn't been clear already, don't use subme=1 as a "fast first pass" option.
Use subme=2 instead; 1 and below now enable a fast (lower quality) lookahead mode.

Faster weightp analysis
Modify pixel_var slightly to return the necessary information and use it for weight analysis instead of sad/ssd.
Various minor cosmetics.

Fix two issues in weightp
If analysis decided on an offset of -128, x264 would create non-compliant streams.
Fix some cases with nearly all intra blocks where analysis could pick very weird weights.
Also add some asserts to check compliancy.

Allow compilation with non-Apple GCC on OS X

Use __attribute__((may_alias)) for type-punning
GCC thinks pointer casts to unions aren't valid with strict aliasing.
Also use M32() in y4m.c.
Enable -Wstrict-aliasing again since all such warnings are fixed.

100l in deadlock fix

FLV muxing support

Fix rare deadlock introduced in weightp

Actually add -Wno-strict-aliasing to configure

Various weightp fixes
Make weightp results match in threaded vs non-threaded mode.
Fix two-pass with slow-firstpass.

Fix all aliasing violations
New type-punning macros perform write/read-combining without aliasing violations per the second-to-last part of 6.5.7 in the C99 specification.
GCC 4.4, however, doesn't seem to have read this part of the spec and still warns about the violations.
Regardless, it seems to fix all known aliasing miscompilations, so perhaps the GCC warning generator is just broken.
As such, add -Wno-strict-aliasing to CFLAGS.

Fix 10l in weightp on ARM

Fix one (of possibly many) miscompilations in weightp
Use NOINLINE and some emms calls to fix emms reordering issues.
This issue occurred with some GCC versions if threads > 1 and the phase of the moon was right.
Also a cosmetic in x264.c.

Fix pixel_ssd on win64
Didn't preserve XMM registers, may or may not have caused problems.

Fix weightp logfile parsing on MinGW


Fix weightp on ARM + PPC
No ARM or PPC assembly yet though.

Weighted P-frame prediction
Merge Dylan's Google Summer of Code 2009 tree.
Detect fades and use weighted prediction to improve compression and quality.
"Blind" mode provides a small overall quality increase by using a -1 offset without doing any analysis, as described in JVT-AB033.
"Smart", the default mode, also performs fade detection and decides weights accordingly.
MB-tree takes into account the effects of "smart" analysis in lookahead, even further improving quality in fades.
If psy is on, mbtree is on, interlaced is off, and weightp is off, fade detection will still be performed.
However, it will be used to adjust quality instead of create actual weights.
This will improve quality in fades when encoding in Baseline profile.

Doesn't add support for interlaced encoding with weightp yet.
Only adds support for luma weights, not chroma weights.
Internal code for chroma weights is in, but there's no analysis yet.
Baseline profile requires that weightp be off.
All weightp modes may cause minor breakage in non-compliant decoders that take shortcuts in deblocking reference frame checks.
"Smart" may cause serious breakage in non-compliant decoders that take shortcuts in handling of duplicate reference frames.

Thanks to Google for sponsoring our most successful Summer of Code yet!

Fix assert failure in the case of forced i-frames
Note that this applies to non-IDR i-frames, not IDR-frames.
This fix is also required for future open-gop.

Fix issues relating to input/output files being pipes/FIFOs

Various ARM-related fixes
Fix comment for mc_copy_neon.
Fix memzero_aligned_neon prototype.
Update NEON (i)dct_dc prototypes.
Duplicate x86 behavior for global+hidden functions.

Fix miscompilation with gcc 4.3 on ARM
Aliasing violation in spatial prediction caused nasty artifacts.
Shut up two other GCC warnings while we're at it.

Fix extremely rare infinite loop in 2-pass VBV
Implicit conversion from double->float lost enough precision to cause the loop termination condition to never trigger.
Bug report by Tal Aloni.

Fix large file support, broken in r1302

Dramatically reduce size of pixel_ssd_* asm functions
~10k of code size eliminated.

fix bottom-right pixel of lowres planes, which was uninitialized.
weirdly, valgrind reported this only with --no-asm.


ISC-license x86inc.asm
As the assembly abstraction layer is very useful in non-x264 projects, it is now ISC (simplified BSD) so that others, even in commercial projects, can use it as well.

Various minor CABAC optimizations

Fix bug in b-pyramid strict
Bug caused invalid streams in some situations.

Remove non-mod16 warning
Compression only "suffers" by an extremely marginal amount and too many people misinterpret the warning.

Fix two warnings + some minor optimizations

Fix a typo in b-pyramid help
And an errant space in common/macroblock.c

A bit more write-combining in macroblock_cache_load

split muxers.c into one file per format
simplify internal muxer API

Update fprofile with the latest change to b-pyramid

Fix assertion fail and incorrect costs with pyramid+VBV
Deal properly with QPfile'd B-refs.x264 should handle multiple B-refs per minigop now, though only via forced frametypes.

Improve CRF initial QP selection, fix get_qscale bug
If qcomp=1 (as in mb-tree), we don't need ABR_INIT_QP.
get_qscale could give slightly weird results with still images

Print more accurate error message if dump_yuv fails

Reduce memory usage of b-adapt 2 trellis
Also fix a minor bug where the algorithm ignored the last frame in the trellis.

Make B-pyramid spec-compliant
The rules of the specification with regard to picture buffering for pyramid coding are widely ignored.
x264's b-pyramid implementation, despite being practically identical to that proposed by the original paper, was technically not compliant.
Now it is.
Two modes are now available:
1) strict b-pyramid, while worse for compression, follows the rule mandated by Blu-ray (no P-frames can reference B-frames)
2) normal b-pyramid, which is like the old mode except fully compliant.
This patch also adds MMCO support (necessary for compliant pyramid in some cases).
MB-tree still doesn't support b-pyramid (but will soon).

Add missing free for nal_buffer
Fixes a memory leak.

sync yasm macros to ffmpeg

eliminate some divisions

Fix glitches with slow-firstpass + weightb + multiref + 2pass
Bug in r1277

Simplify some code in b-adapt 2's trellis

Fix a very rare integer overflow in slicetype analysis
Caused an assert failure when it occurred.
Bug is as old as adaptive B-frames.

Reduce the aggressiveness of 2-pass VBV
Now that B-frames are properly covered, we don't have to be as aggressive.
This eliminates some issues with skyrocketing QPs in B-frames in 2-pass VBV.

Fix regression: disable flash detection without B-frames

change all dct arrays to 1d.
the C standard doesn't allow you to iterate 1-dimensionally over 2d arrays, and nothing other than the dsp functions themselves cares about the 2dness of dct.
this fixes a miscompilation in x264_mb_optimize_chroma_dc.

Add row-based VBV for B-frames
While B-frames still aren't explicitly covered by ratecontrol, this should resolve issues of VBV underflows due to larger-than-expected B-frames.

Improve VBV, fix bug in 2-pass VBV introduced in MB-tree
Bug caused AQ'd row/frame costs to not be calculated (and thus caused underflows).
Also make VBV more aggressive with more threads in 2-pass mode.
Finally, --ratetol now affects VBV aggressiveness (higher is less aggressive).

Optimize exp2fix8
Slightly faster and more accurate rounding.

Avoid scenecuts in flashes and similar situations
"Flashes" are defined as any scene which lasts a very short period before a previous scene returns.
A common example of this is of course a camera flash.
Accordingly, look ahead during scenecut analysis and rule out the possibility of certain frames being scenecuts.
Also handles cases of tons of short scenes in sequence and avoids making those scenecuts as well.
Can only catch flashes of 1 frame in length with b-adapt 1.
With b-adapt 2, can catch flashes of length --bframes.
Speed cost should be negligible.

Fix bug where x264 generated non-compliant bitstreams with insane SAR values

rm msvc project files and related ifdefs

SSE4 version of 4x4 idct
27->24 clocks on Nehalem.
This is really just an excuse to use "movsd" in a real function.
Add some comments to subsum-related macros in x86util.

Constrained intra prediction support
Enable with --constrained-intra.Significantly reduces compression, but required for the base layer of SVC encodes and maybe some other use-cases.

Commit sponsored by a media streaming company that wishes to remain anonymous.

Slightly improve non-RD p8x8 mode decision
Subpartition costs are effectively zero in CABAC if sub-8x8 search is off.

Reorder reference frames optimally on second pass
About +0.1-0.2% compression at normal bitrates, up to +1% at very low bitrates.
Only works if the first pass uses the same number of refs as the second (i.e. not with fast first pass).
Thus, only worthwhile at insanely slow speeds: as such, enable slow-firstpass by default with preset placebo.
Note that this changes the stats file format!

Fix typo in ratecontrol_summary

Clip log2_max_frame_num
It's still much higher than it needs to be, but that will be fixed with the upcoming MMCO patch.
Also make sure we don't write too large a frame_num or poc in slice header.

Fix some issues with 3-pass statsfile handling
The value of i_frame during encoder_close was incorrect.

Fix ctrl-C termation message with few frames encoded

Add support for single-frame VBV, improve compliance
This allows both constant-framesize and capped-framesize encoding.
Literal constant framesize isn't actually supported yet due to the lack of
filler support.
Example with 30fps video: --vbv-bufsize 200 --vbv-maxrate 6000 will ensure that
no frame is ever larger than 200 kilobits.

One example use-case of this is for zero-delay streaming where bandwidth costs
need to be minimized.If every frame is smaller than 200 kilobits and the
client has a 6 megabit connection, every single frame can be instantly sent
to the client and handled without any decoder-side buffer.

Fix a mistake in VBV calculation--this may have caused the VBV to be slightly
non-compliant in some situations without x264 realizing it.
Add primitive prediction handling for rows with quantizers lower than their
reference.This slightly improves VBV in CBR mode.
Various other minor improvements to VBV, mostly to make single-frame VBV work.

Commit sponsored by a media streaming company that wishes to remain anonymous.

Fix 10l in API change
frame_num was set to 1, not 0, for the first frame.This broke spec compliance.
Didn't actually seem to cause any problems though except for breaking decoding on Quicktime.

Allow user-set FPS for inputs other than YUV

Improve threaded frame handling
Avoid unnecessary cond_wait

Attempt to detect miscompilation due to bug in gcc 4.2
I don't know if this bug still affects latest x264, but it can't hurt to try to detect it.
Accordingly refuse to open the encoder if detected.
Apparently VLC (on Windows) has been distributed for some time with a completely broken x264 due to the use of a completely broken compiler (gcc 4.2).In particular, the MV costs seem to be calculated incorrectly on win32 when linking from an application compiled without -ffast-math to an application with -ffast-math.
I am not entirely certain why this occurs, but the result is, unsurprisingly, encoding quality that makes MPEG-2 look good, due to the motion search being completely broken.

Really fix encoder_close crash this time
Not-entirely-fixed in r1253.

Check for 16x16 partitions masquerading as smaller ones
Saves a few bits when using qpel-RD.

Update config.guess/sub; add Snow Leopard support

Fix integer overflow in 2-pass VBV
Bug caused slight undersizing in 2-pass mode in some cases.

Fix bug with various bizarre commandline combinations and mbtree
Second pass would have mbtree on even though the first pass didn't (and thus encoding would immediately fail).

Add intra prediction modes to output stats
Also eliminate some NANs in stat output with intra-only encoding.
Marginal speedup: disable stat calculation if log level is below X264_LOG_INFO.
Various minor cosmetics.

Overhaul syntax in muxers.c/matroska.c
The inconsistent syntax in these files has finally come to an end.

Major API change: encapsulate NALs within libx264
libx264 now returns NAL units instead of raw data.x264_nal_encode is no longer a public function.
See x264.h for full documentation of changes.
New parameter: b_annexb, on by default.If disabled, startcodes are replaced by sizes as in mp4.
x264's VBV now works on a NAL level, taking into account escape codes.
VBV will also take into account the bit cost of SPS/PPS, but only if b_repeat_headers is set.
Add an overhead tracking system to VBV to better predict the constant overhead of frames (headers, NALU overhead, etc).

Add missing fclose for mbtree input statsfile on second pass
Bug report by VFRmaniac

Improve progress indicator behavior
Progress indicator will now indicate based on output frame, not input frame.

Update yasm configure check
lzcnt apparently requires yasm 0.6.2.

Make MV costs global instead of static
Fixes some extremely rare threading race conditions and makes the code cleaner.
Downside: slightly higher memory usage when calling multiple encoders from the same application.

Don't print scenecut message multiple times in verbose mode
Occurred mostly with b-adapt 2.

Optimize rounding of luma and chroma DC coefficients
Reduce bitrate mostly-losslessly at low quantizers.
In some rare cases, bitrate reduction may be as high as 10%.
Luma rounding optimization (helps much less than chroma) requires trellis.

Fix crash if encoder_close is called before delayed frames are flushed
Also no longer flush frames when ctrl-Cing x264, so x264 will close faster.

Improve x264 help
Now has three help options: --help, --longhelp, and --fullhelp.
--help only shows the most basic options; most users should not need more than these.
Add usage examples.
Fix typo in a comment.

Factor out a redundant RD call in qpel-RD
Fixes a problem that was supposed to be, but didn't, get fully fixed in r1238.

Fix RD early-skip
Small quality improvement and speedup, was broken by r1214.

Faster CAVLC mb header writing for B macroblocks

Compile fixes for pre-ARMv6T2 and/or PIC

Change priority handling on some OSs
Instead of setting the lookahead thread to max priority, lower all the other threads' priorities instead.
This is particularly useful when the "max priority" is "realtime", as in Windows, which can cause some problems.

Threaded lookahead
Move lookahead into a separate thread, set to higher priority than the other threads, for optimal performance.
Reduces the amount that lookahead bottlenecks encoding, greatly increasing performance with lookahead-intensive settings (e.g. b-adapt 2) on many-core CPUs.
Buffer size can be controlled with --sync-lookahead, which defaults to auto (threads+bframes buffer size).
Note that this buffer is separate from the rc-lookahead value.
Note also that this does not split lookahead itself into multiple threads yet; this may be added in the future.
Additionally, split frames into "fdec" and "fenc" frame types and keep the two separate.
This split greatly reduces memory usage, which helps compensate for the larger lookahead size.
Extremely special thanks to Michael Kazmier and Alex Giladi of Avail Media, the original authors of this patch.

Force a link error in case of incompatible API
This is because the number of bug reports due to miscompiled ffmpeg builds is reaching critical mass.
The name of x264_encoder_open is now #defined based on the current X264_BUILD.
Note that this changes the calling convention required for dlopen, but not for ordinary calls to x264_encoder_open.

Get rid of "CBR" descriptor from qcomp
Though technically accurate in some vague way, I have never actually seen this
option used correctly, rather it has been used by hundreds of people who can't
read the documentation and believe that qcomp=0 is what should be used for CBR

Faster me=tesa
But it still spends all too much time in me_search_ref rather than asm.

Multi-slice encoding support
Slicing support is available through three methods (which can be mixed):
--slices sets a number of slices per frame and ensures rectangular slices (required for Blu-ray).Overridden by either of the following options:
--slice-max-mbs sets a maximum number of macroblocks per slice.
--slice-max-size sets a maximum slice size, in bytes (includes NAL overhead).
Implement macroblock re-encoding support to allow highly accurate slice size limitation.Might be useful for other things in the future, too.

Fix a valgrind warning in b-adapt 2

fix asm symbols for oprofile (regression in r1221)

Fix bug in intra analysis in B-frames
i8x8/i4x4 never got analysed when fast_intra was toggled and RD was off; up to a 2-3% quality improvement in non-RD mode.
With this bug dating back to r369, this is probably the second-oldest bug ever fixed in x264.

Fix bug in b16x16 qpel RD
Incorrect cost was used to initialize the search.

Check minimum chroma QP in addition to luma QP during CQM init
Correctly error out if the implied minimum chroma QP is too low.
Add missing emms to checkasm macroblock_tree_propagate test.

Faster mbtree propagate and x264_log2, less memory usage
Avoid an int->float conversion with a small table.
Change lowres_inter_types to a bitfield; cut its size by 75%.
Somewhat lower memory usage with lots of bframes.
Make log2/exp2 tables global to avoid duplication.

Fix keyint=1 + VBV + rc-lookahead

Faster x264_exp2fix8
22->13 cycles on Core 2 with mfpmath=sse

compile x86 with fpmath=sse by default

ARM configure: enable NEON-related options by default
When compiling for ARM, x264 will compile by default for Cortex A8 unless specified otherwise.
To compile for pre-ARMv6, --disable-asm is required.

2-pass VBV fixes
Properly run slicetype frame cost with 2pass + MB-tree.
Slash the VBV rate tolerance in 2-pass mode; increasing it made sense for the highly reactive 1-pass VBV algorithm, but not for 2-pass.
2-pass's planned frame sizes are guaranteed to be reasonable, since they are based on a real first pass, while 1-pass's, based on lookahead SATD, cannot always be trusted.

GSOC merge part 8: ARM NEON intra prediction assembly functions (partial)
4x4 dc/h/ddr/ddl, 8x8 dc/h, 8x8c h/v, 16x16 dc/h/v

GSOC merge part 7: ARM NEON deblock assembly functions (partial)
Originally written for ffmpeg by Mans Rullgard; ported by David.
Luma and chroma inter deblocking; no intra yet.

GSOC merge part 6: ARM NEON quant assembly functions (partial)
(de)quant 4x4, (de)quant 8x8, (de)quant DC, coeff_last

GSOC merge part 5: ARM NEON dct assembly functions
(i)dct4x4dc, (i)dct4x4, (i)dct8x8, (i)dct_dc, zigzag_scan_frame_4x4

GSOC merge part 4: ARM NEON mc assembly functions
prefetch, memcpy_aligned, memzero_aligned, avg, mc_luma, get_ref, mc_chroma, hpel_filter, frame_init_lowres

GSOC merge part 3: ARM NEON pixel assembly functions

GSOC merge part 2: ARM stack alignment
Neither GCC nor ARMCC support 16 byte stack alignment despite the fact that NEON loads require it.
These macros only work for arrays, but fortunately that covers almost all instances of stack alignment in x264.

Fix unaligned accesses in bitstream writer
Fixes x264 on CPUs with no unaligned access support (e.g. SPARC).
Improves performance marginally on CPUs with penalties for unaligned stores (e.g. some x86).

Fix bug in calculation of I-frame costs with AQ.

GSOC merge part 1: Framework for ARM assembly optimizations
x264 will detect which ARM core it's building for and only build NEON asm if the target is ARMv6 or above, then enable NEON at runtime.

Fix a bug in checkasm and two OSX fixes
MC chroma checkasm test could crash in some situations
Remove -lmx, as it's not needed and the iPhone doesn't have it.
Remove unused sqrtf emulation; it breaks if math.h is included.

Improve QPRD
Always check the last macroblock's QP, even if the normal search doesn't reach it.
Raise the failure threshold when moving towards the last macroblock's QP.
0.2-1% improved compression.

Fix MB-tree with keyint<3
Also slightly improve VBV keyint handling.

Fix bug in VBV lookahead + no MB-tree
I-frames need to have VBV lookahead run on them as well.

Add support for frame-accurate parameter changes
Parameter structs can now be passed with individual frames.
The previous method would only change the parameter of what was currently being encoded, which due to delay might be very far from an intended exact frame.
Also add support for changing aspect ratio.Only works in a stream with repeating headers and requires the caller to force an IDR to ensure instant effect.

Fix x264_encoder_reconfig with multithreading
New behavior: reconfigging the encoder will result in changes being applied
to each of the encoding threads as they finish encoding the current frame.

Fix two bugs in QPRD
QPRD could in some cases force blocks to skip when they shouldn't be ~(+0.01db)
Force QPRD to abide by qpmin/qpmax restrictions.

Lookahead VBV
Use the large-scale lookahead capability introduced in MB-tree for ratecontrol purposes.
(Does not require MB-tree, however.)
Greatly improved quality and compliance in 1-pass VBV mode, especially in CBR; +2db OPSNR or more in some cases.
Fix some other bugs in VBV, which should improve non-lookahead mode as well.
Change the tolerance algorithm in row VBV to allow for more significant mispredictions when buffer is nearly full.
Note that due to the fixing of an extremely long-standing bug (>1 year), bitrates may change by nontrivial amounts in CRF without MB-tree.

Fix bug in b-adapt 1
B-adapt 1 didn't use more than MAX(1,bframes-1) B-frames when MB-tree was off.

Fix a potential failure in VBV
If VBV does underflow, ratecontrol could be permanently broken for the rest of the clip.
Revert part of the previous VBV changes to fix this.

new API function x264_encoder_delayed_frames.
fix x264cli on streams whose total length is less than the encoder latency.

Add no-mbtree to fprofile (and fix pyramid in fprofile)

Don't print a warning about direct=auto in 2pass when B-frames are off

fix lowres padding, which failed to extrapolate the right side for some resolutions.
fix a buffer overread in x264_mbtree_propagate_cost_sse2. no effect on actual behavior, only theoretical correctness.
fix x264_slicetype_frame_cost_recalculate on I-frames, which previously used all 0 mb costs.
shut up a valgrind warning in predict_8x8_filter_mmx.

simd part of x264_macroblock_tree_propagate.
1.6x faster on conroe.

MB-tree fixes:
AQ was applied inconsistently, with some AQed costs compared to other non-AQed costs. Strangely enough, fixing this increases SSIM on some sources but decreases it on others. More investigation needed.
Account for weighted bipred.
Reduce memory, increase precision, simplify, and early terminate.

Add missing free()s for new data allocated for MB-tree
Eliminates a memory leak.

Fix keyframe insertion with MB-tree and no B-frames

Fix MP4 output (bug in malloc checking patch)

Gracefully terminate in the case of a malloc failure
Fuzz tests show that all mallocs appear to be checked correctly now.

Fix a potential infinite loop in QPfile parsing on Windows
ftell doesn't seem to work properly on Windows in text mode.

Fix delay calculation with multiple threads
Delay frames for threading don't actually count as part of lookahead.

Add "veryslow" preset
Apparently some people are actually *using* placebo, so I've added this preset to bridge the gap.

Macroblock-tree ratecontrol
On by default; can be turned off with --no-mbtree.
Uses a large lookahead to track temporal propagation of data and weight quality accordingly.
Requires a very large separate statsfile (2 bytes per macroblock) in multi-pass mode.
Doesn't work with b-pyramid yet.
Note that MB-tree inherently measures quality different from the standard qcomp method, so bitrates produced by CRF may change somewhat.
This makes the "medium" preset a bit slower.Accordingly, make "fast" slower as well, and introduce a new preset "faster" between "fast" and "veryfast".
All presets "fast" and above will have MB-tree on.
Add a new option, --rc-lookahead, to control the distance MB tree looks ahead to perform propagation analysis.
Default is 40; larger values will be slower and require more memory but give more accurate results.
This value will be used in the future to control ratecontrol lookahead (VBV).
Add a new option, --no-psy, to disable all psy optimizations that don't improve PSNR or SSIM.
This disables psy-RD/trellis, but also other more subtle internal psy optimizations that can't be controlled directly via external parameters.
Quality improvement from MB-tree is about 2-70% depending on content.
Strength of MB-tree adjustments can be tweaked using qcompress; higher values mean lower MB-tree strength.
Note that MB-tree may perform slightly suboptimally on fades; this will be fixed by weighted prediction, which is coming soon.

Various 1-pass VBV tweaks
Make predictors have an offset in addition to a multiplier.
This primarily fixes issues in sources with lots of extremely static scenes, such as anime and CGI.
We tried linear regressions, but they were very unreliable as predictors.
Also allow VBV to be slightly more aggressive in raising QPs to avoid not having enough bits left in some situations.
Up to 1db improvement on some clips.

Fix another 10L in QPRD
An entry in subpel_iterations was missing.
I have no idea how QPRD was working at all without this change.

Update help and cleanup in ratecontrol.c
Deal with some out-of-date information.

15% faster refine_bidir_satd, 10% faster refine_bidir_rd (or less with trellis=2)
re-roll a loop (saves 44KB code size, which is the cause of most of this speed gain)
don't re-mc mvs that haven't changed

Faster bidir_rd plus some bugfixes
Cache chroma MC during refine_bidir_rd and use both the luma and chroma caches to skip MC in macroblock_encode.
Fix incorrect call to rd_cost_part; refine_bidir_rd output was incorrect for i8>0.
Remove some redundant clips.
~12% faster refine_bidir_rd.

Add "fastdecode" tune option
It does what it says it does.

Fix two bugs in QPRD
fprofile settings now actually fprofile QPRD.
Don't use i_mbrd before initializing it.

Fix 10l in QPRD
Trellis used wrong lambda with trellis=1

Fix a nondeterminism with threads and subme>7
Also add a few more checks to eliminate the need for spel_border.

Add QPRD support as subme=10
Refactor trellis lambda selection to be done in analyse_init instead of in trellis.
This will allow for more easy adaption of lambda later on; for now it allows constant lambda across variable QPs.
QPRD is only available with adaptive quantization enabled and generally improves SSIM and visual quality.
Additionally, weight the SSD values from RD based on the relative QP offset for chroma; helps visually at high QPs where chroma has a lower QP than luma.
This fixes some visual artifacts created by QPRD at high QPs.
Note that this generally hurts PSNR and SSIM, and so is only on when psy-RD is on.

SSSE3 cachesplit workaround for avg2_w16
Palignr-based solution for the most commonly used qpel function.
1-1.5% faster overall on Core 2 chips.

shut up valgrind warnings in trellis

New AQ algorithm option
"Auto-variance" uses log(var)^2 instead of log(var) and attempts to adapt strength per-frame.
Generates significantly better SSIM; on by default with --tune ssim.
Whether it generates visually better quality is still up for debate.
Available as --aq-mode 2.

Cacheline-split SSSE3 chroma MC
~70% faster chroma MC on 32-bit Conroe
Also slightly faster SSSE3 intra_sad_8x8c

Improve documentation of qp/crf options

Merge array_non_zero into zigzag_sub
Faster lossless, cleaner code.
SSSE3 version of zigzag_sub_4x4_field, faster lossless interlaced coding.

Fix bug in reference frame autoadjustment
For some types of input file, x264 did the adjustment before width/height were known.

Fix fprofile settings to match changes in defaults
Also add b-adapt 2 to fprofile.

Slightly faster dequant_flat assembly
Eliminate some redundant shifts.

Totally new preset system for x264.c (not libx264), new defaults
Other new features include "tune" and "profile" settings; see --help for more details.
Unlike most other settings, "preset" and "tune" act before all other options.
However, "profile" acts afterwards, overriding all other options.
Our defaults have also changed: new defaults are --subme 7 --bframes 3 --8x8dct --no-psnr --no-ssim --threads auto --ref 3 --mixed-refs --trellis 1 --weightb --crf 23 --progress.
Users will hopefully find these changes to greatly improve usability.

Update Gabriel's email address in AUTHORS

Early termination for chroma encoding
Faster chroma encoding by terminating early if heuristics indicate that the block will be DC-only.
This works because the vast majority of inter chroma blocks have no coefficients at all, and those that do are almost always DC-only.
Add two new helper DSP functions for this: dct_dc_8x8 and var2_8x8.mmx/sse2/ssse3 versions of each.
Early termination is disabled at very low QPs due to it not being useful there.
Performance increase is ~1-2% without trellis, up to 5-6% with trellis=2.
Increase is greater with lower bitrates.

Fix bug in checkasm
frame_init_lowres_core check didn't check the C plane.
However, all x86 and PPC assembly was correct regardless of the unit test being incorrect.

Add subpartition cost for sub-8x8 blocks
Improves sub-p8x8 mode decision.

Yet more CABAC and CAVLC optimizations
Also clean up a lot of pointless code duplication in CAVLC MV coding.

Various CABAC optimizations and cleanups
Faster CABAC CBF context calculation for inter blocks.
Add x264_constant_p(), will probably be useful in the future as well.
Simpler subpartition functions.
Clean up and optimize mvd_cpn a bit more.
Various other minor optimizations.

AltiVec version of frame_init_lowres_core. 22.4x faster than C on PPC7450 and 25x on PPC970MP.

MMX CABAC mvd sum calculation
Faster CABAC mvd coding.

Faster MV prediction
Smaller code size, plus I get to use goto.

Fix potential crash in checkasm
ssim_end4_sse2 requires aligned sums

SSSE3, faster SSE2/MMX integral_init4v
The real reason I wrote this was an excuse to use shufpd.

configure check for uclinux

fix a crash on frame width <= 48 pixels

configure check for cc, rather than reporting lack of compiler as an asm error.
configure check for -mno-cygwin, since it's removed from gcc4.

a better way to keep track of mv candidates.
2-4% faster dia, hex, and umh.

reorder some motion estimation patterns.
this change is useless on its own, but segregates the bitstream-changing part out of my next optimization.

Fix VBV warning broken in r915
x264 will now correctly warn about maxrate specified without bufsize even when a level is not set.

configure check for ssse3-capable binutils

Fix 10L in r1155
Broke --me esa/tesa due to forgetting to add handling for x264_cost_mv_fpel.

Fix bug where satd was incorrectly used with subme<=1
Faster subme<=1 with i4x4 enabled.

Remove some pointless error handling code in cabac/cavlc

Save some memory on mv cost arrays
Have quantizers that use the same lambda share the same cost array.

Various CABAC and CAVLC optimizations
Backport CAVLC partial-inlining early termination to CABAC (~2-4% faster CABAC residual coding)

fix a race condition at the end of thread_input

Various trellis speed optimizations

Make i686 the default arch on x86_32
Disabling asm will default to a generic arch.
Also fix configure for gcc 4.4.

Faster signed golomb coding
3% faster CAVLC RDO and bitstream writing.

Faster spatial direct MV prediction
unroll/tweak col_zero_flag

More CABAC and CAVLC optimizations
Simplified function calling for block_residual_write_(cabac|cavlc) and improved sigmap coding.
Tried making 0/1-bit specific versions of CABAC asm, but benefit was minimal under GCC 4.3.
Helped a decent bit under 3.4, but you shouldn't be using such old versions anyways.

Various optimizations in frametype lookahead

Some cosmetics/cleanup
Move some macros to x86util.asm that should have been there to begin with.
Fix a typo that didn't cause any issues.

fix "incompatible types in initialization" compilation issues with GCC 4.3 (which is stricter than previous compiler version)

fix conversions between vectors with differing element types or numbers of subparts errors

Add "coded blocks" stat to output information.
This measures the total percentage of blocks, intra and inter, which have nonzero coefficients.
"y,uvAC,uvDC" refers to luma, chroma DC, and chroma AC blocks.
Note that skip blocks are included in this stat.

Enable asm predict_8x8_filter
I'm not entirely sure how this snuck its way out of holger's intra pred patch.

Remove various bits of dead code found by CLANG.

Slightly faster SSE4 SA8D, SSE4 Hadamard_AC, SSE2 SSIM
shufps is the most underrated SSE instruction on x86.

Various CABAC optimizations
Move calculation of b_intra out of the core residual loop and hardcode it where applicable.
Inlining cabac_mb_mvd was unnecessary and wasted tremendous amounts of code size.Inlining only cache_mvd is faster and significantly smaller.

CAVLC optimizations
faster bs_write_te, port CABAC context selection optimization to CAVLC.

Since the bypass case is quite unlikely, especially when doing merged sigmap/level coding,
it's faster to use a branch than a cmov.

Activate intra_sad_x3_8x8c in lookahead

MBAFF interlaced coding is not allowed in baseline profile

intra_sad_x3_8x8 assembly

intra_sad_x3_4x4 assembly

intra_sad_x3_8x8c assembly
Also fix intra_sad_x3_16x16's use of "n" as a loop variable (broke SWAP)

Shave one instruction off CABAC encode_decision
range_lps>>6 ranges from 4-7, so (range_lps>>6)-4 == (range_lps>>6) & 3

Faster probe_skip
Add a second chroma threshold after the DC transform.

Add missing "static" qualifier to two arrays
Should slightly improve performance.

SSE2 zigzag_interleave
Replace PHADD with FastShuffle (more accurate naming).
This flag represents asm functions that rely on fast SSE2 shuffle units, and thus are only faster on Phenom, Nehalem, and Penryn CPUs.

Faster integral_init
palignr to avoid unaligned loads is worth it in inith, but not initv.

Faster SSSE3 hpel_filter_v
~10% faster hpel_filter on 64-bit Penryn.
32-bit version by Jason Garrett-Glaser.

Faster SSE2 pixel_var
Optimized using the DEINTB method from r1122.~32% faster var_16x16 on Conroe.

SSSE3 hpel_filter_v
Optimized using the same method as in r1122.Patch partially by Holger.
~8% faster hpel filter on 64-bit Nehalem

Update some asm copyright headers

Vastly faster SATD/SA8D/Hadamard_AC/SSD/DCT/IDCT
Heavily optimized for Core 2 and Nehalem, but performance should improve on all modern x86 CPUs.
16x16 SATD: +18% speed on K8(64bit), +22% on K10(32bit), +42% on Penryn(64bit), +44% on Nehalem(64bit), +50% on P4(32bit), +98% on Conroe(64bit)
Similar performance boosts in SATD-like functions (SA8D, hadamard_ac) and somewhat less in DCT/IDCT/SSD.
Overall performance boost is up to ~15% on 64-bit Conroe.

Update x264 copyright date

Remove pre-scenecut from fprofile commands as well
Also add psy-trellis to fprofile

Slightly faster 8x16 SAD on Penryn Core 2
Same as MMX 8x16 cacheline SAD, but calls SSE2 8x16 SAD in non-cacheline case.
Only Nehalem benefits from sizes smaller than 8x16, and Nehalem doesn't use cacheline functions, so no smaller versions are included.

Fix scenecut and VBV with videos of width/height <= 32
Also remove an unused variable

Remove non-pre scenecut
Add support for no-b-adapt + pre-scenecut (patch by BugMaster)
Pre-scenecut was generally better than regular scenecut in terms of accuracy and regular scenecut didn't work in threaded mode anyways.
Add no-scenecut option (scenecut=0 is now no scenecut; previously it was -1)
Fix an incorrect bias towards P-frames near scenecuts with B-adapt 2.
Simplify pre-scenecut code.

Add AltiVec version of hadamard_ac. 2.4x faster than the C version.
Note this this implementation is pretty naive and should be improved
by implementing what's discussed in this ML thread:
date: Mon, Feb 2, 2009 at 6:58 PM
subject: Re: [x264-devel] [PATCH] AltiVec implementation of hadamard_ac routines

Fix regression in r1085
Deblocking was very slightly incorrect with partitions=all.
Bug found by BugMaster.

Optimize neighbor CBP calculation and fix related regression
r1105 introduced array overflow in cbp handling

Show FPS when importing a raw YUV file

Windows 64-bit support
A "make distclean" is probably required after updating to this revision.

Minor fixes and cosmetics
Suppress a GCC warning, fix a non-problematic array overflow, one REP->REP_RET.

fix 10l in 75b495f2723fcb77f
Original thread:
date: Mon, Feb 9, 2009 at 9:37 PM
commit: Spare a vec_perm and a vec_mergeh though using a LUT of permutation vectors . (Guillaume Poirier )
:Mon Feb 9 21:17:33 2009 +0100
Spare a vec_perm and a vec_mergeh though using a LUT of permutation vectors.

Promote chroma planes to 16 byte alignment.
This will allow simplifying vectors loads that can only load 16-bytes
aligned data (such as AltiVec).

Fix 10L in intra pred
Forgetting a %define resulted in SIGILL on 32-bit systems without SSE (e.g. Athlon XP).

Add decimation in i16x16 blocks
Up to +0.04db with CAVLC, generally a lot less with CABAC.

Much faster CABAC residual context selection
Up to ~17% faster CABAC RDO, ~36% faster intra-only CABAC RDO.
Up to 7% faster overall in extreme cases.

Faster coeff_last64 on 32-bit

More intra pred asm optimizations
SSSE3 version of predict_8x8_hu
SSE2 version of predict_8x8c_p
SSSE3 versions of both planar prediction functions
Optimizations to predict_16x16_p_sse2
Some unnecessary REP_RETs -> RETs.
SSE2 version of predict_8x8_vr by Holger.
SSE2 version of predict_8x8_hd.
Don't compile MMX versions of some of the pred functions on x86_64.
Remove now-useless x86_64 C versions of 4x4 pred functions.
Rewrite some of the x86_64-only C functions in asm.

Speed-up mc_chroma_altivec by using vec_mladd cleverly, and unrolling.
Also put width == 2 variant in its own scalar function because it's faster
than a vectorized one.

Merging Holger's GSOC branch part 2: intra prediction
Assembly versions of most remaining 4x4 and 8x8 intra pred functions.
Assembly version of predict_8x8_filter.
A few other optimizations.
Primarily Core 2-optimized.

10l: fix compilation with GCC 4.3+

Faster 8x8dct+CAVLC interleave
Integrate array_non_zero with the CAVLC 8x8dct interleave function.
Roughly 1.5-2x faster than the original separate array_non_zero method.

Measure CBP cost in i8x8 RD refinement
~0.02-0.05db PSNR gain at high quants in intra-only encoding, pretty small otherwise.
Allows a small optimization in i8x8 encoding.

Take advantage of saturated signed horizontal sum instructions in
the variance computation epilogue since there won't be any overflow
triggering an overflow.
Suggested by Loren Merritt

Massive overhaul of nnz/cbp calculation
Modify quantization to also calculate array_non_zero.
PPC assembly changes by gpoirior.
New quant asm includes some small tweaks to quant and SSE4 versions using ptest for the array_non_zero.
Use this new feature of quant to merge nnz/cbp calculation directly with encoding and avoid many unnecessary calls to dequant/zigzag/decimate/etc.
Also add new i16x16 DC-only iDCT with asm.
Since intra encoding now directly calculates nnz, skip_intra now backs up nnz/cbp as well.
Output should be equivalent except when using p4x4+RDO because of a subtlety involving old nnz values lying around.
Performance increase in macroblock_encode: ~18% with dct-decimate, 30% without at CRF 25.
Overall performance increase 0-6% depending on encoding settings.

Add PowerPC support for "checkasm --bench", reading the time base register.
This isn't ideal since the `time base' register is running at a fraction
of the processor cycle speed, so the measurement isn't as precise as x86's
It's better than nothing though...

fix detection of pthread and isfinite on OpenBSD

remove $ECHON kludge, which broke on SunOS. bring back `gcc -MT`.
remove auto-reconfigure on svn update, which has done nothing since we stopped using svn.
fix $AS on sparc (was disabled by mmx check).
fix --extra-asflags (was ignored).
mark bash scripts as bash, not sh
patch partly by Greg Robinson and Jugdish.

1.6x faster satd_c (and sa8d and hadamard_ac) with pseudo-simd.
60KB smaller binary.

Hack around a potential failure point in VBV
pred_b_from_p can become absurdly large in static scenes, leading to rare collapses of quality with VBV+B-frames+threads.
This isn't a final fix, but should resolve the problem in most cases in the meantime.

Much faster chroma encoding and other opts
~15% faster chroma encode by reorganizing CBP calculation and adding special-case idct_dc function, since most coded chroma blocks are DC-only.
Small optimization in cache_save (skip_bp)
Fix array_non_zero to not violate strict aliasing (should eliminate miscompilation issues in the future)
Add in automatic substitutions for some asm instructions that have an equivalent smaller representation.

add AltiVec implementation of x264_mc_copy_w16_aligned

add AltiVec implementation of x264_pixel_var_16x16 and x264_pixel_var_8x8

add AltiVec 16 <-> 32 bits conversions macros

Replace 16x16=>32 mul + pack + add by a simple 16x16=>16 multiply-add.
Suggested by Loren.

Eliminate support for direct_8x8_inference=0
The benefit in the most extreme contrived situation was at most 0.001db PSNR, at the cost of slower decoding.
As this option was basically useless, it was a waste of code and prevented some other useful optimizations.
Remove some unused mc code related to sub-8x8 partitions.
Small deblocking speedup when p4x4 is used.
Also remove unused x264_nal_decode prototype from x264.h.

Add AltiVec and CPU numbers detection on OpenBSD.

Add AltiVec implementation of predict_8x8c_p. 2.6x faster than scalar C.

Warn if direct auto wasn't set on the first pass
And, if it wasn't, run direct auto as if it was the first pass, rather than simply forcing temporal direct mode on all frames.
Also a small tweak to coeff_level_run asm.

Changes the PowerPC ppccommon.h header so it no longer checks for a particular
OS such as Linux but instead looks for HAVE_ALTIVEC_H being set.
Fixes all *BSD/PowerPC builds.

update x264_hpel_filter_altivec's prototype to match the one of the C version.
in commit 045ae4045a1827555b3eaab4fbf3c9809e98c58f (factorization of mallocs)
or: Guillaume Poirier <>
Date:Wed Jan 14 21:49:42 2009 +0100
rename vector+array unions to closer match the vector typedefs names.

Add Altivec implementation of all the remaining 16x16 predict routines.

Cache ref costs and use more accurate MV costs
New MV costs should improve quality slightly by improving the smoothness of the field of MV costs (and they're closer to CABAC's actual costs).
Despite being optimized for CABAC, they still help under CAVLC, albeit less.
MV cost change by Loren Merritt

Support forced frametypes with scenecut/b-adapt
This allows an input qpfile to be used to force I-frames, for example.
The same can be done through the library interface.
Document the format of the qpfile in --longhelp and the forcing of frametypes in x264.h
Note that forcing B-frames and B-refs may not always have the intended result.
Patch partially by Steven Walters <>.

Remove an IDIV from i8x8 analysis
Only one IDIV is left in macroblock level code (transform_rd)

Fix regression in r1066
With some combinations of video width and other settings, the scratch buffer was slightly too small.
This caused heap corruption on some systems.
Also prevent merange from being raised during encoding with esa/tesa through encoder_reconfig, as this no longer works.

Disable B-frames in lossless mode
They hurt compression anyways, and direct auto was bugged with lossless.

Factorize in ppccommon.h the conditional inclusion of altivec.h on Linux systems.

Disable __builtin_clz() intrinsic on gcc versions prior to 3.4.
The function did not exist before that version.

Small tweaks to coeff asm
Factor out a few redundant pxors
Related cosmetics

Use the correct strtok under MSVC
Also change one malloc -> x264_malloc

Add stack alignment for lookahead functions
Should allow libx264 to be called from non-gcc-compiled applications without adding force_align_arg_pointer.

Add support for SSE4a (Phenom) LZCNT instruction
Significantly speeds up coeff_last and coeff_level_run on Phenom CPUs for faster CAVLC and CABAC.
Also a small tweak to coeff_level_run asm.

factor mallocs out of hpel, ssim, and esa.
there should now be no memory allocation outside of init-time.

Much faster CAVLC RDO and bitstream writing
Pure asm version of level/run coding.Over 2x faster than C.
Up to 40% faster CAVLC RDO.Overall benefit up to ~7.5% with RDO or ~5% with fast encoding settings.

Cosmetics: cleaner syntax for defining temporary registers in asm
Globally define t#[qdwb], so that only t# needs to be locally defined when reorganizing registers

Much faster CABAC RDO
Since RDO doesn't care about what order bit costs are calculated, merge sigmap and level coding into the same loop in RDO.
This is bit-exact for 4x4dct but slightly incorrect for 8x8dct due to the sigmap containing duplicated contexts.
However, the PSNR penalty of this is extremely small (~0.001db).
Speed benefit is about 15% in 4x4dct and 30% in 8x8dct residual bit cost calculation at QP20.
Overall encoding speed benefit is up to 5%, depending on encoding settings.
Also remove an old unnecessary CABAC table that hasn't been used for years.

VLC table optimizations
Slightly reorganize VLC tables for ~2% faster block_residual_write_cavlc.
Also a small optimization in p8x8 CAVLC.

Fix crash in --me esa/tesa introduced in r1058
Also suppress the last mingw warning message

Optimize variance asm + minor changes
Remove SAD argument from var, not needed anymore.
Speed up var asm a bit by eliminating psadbw and instead HADDWing at end.
Eliminate all remaining warnings on gcc 3.4 on cygwin
Port another minor optimization from lavc (pskip)

Minor CABAC cleanups and related optimizations
Merge the two list tables to allow cleaner MC/CABAC/CAVLC code
Remove lots of unnecessary {s
Port some very minor opts from lavc

faster ESA init
reduce memory if using ESA and not p4x4

More macroblock_cache optimizations
Patch partially by Loren Merritt

Faster macroblock_cache_rect
Explicit loop unrolling

Optimizations in predict_mv_direct
Add some early terminations and minor optimizations
This change may also fix the extremely rare direct+threading MV bug.

Fix visual corruption when picture width was not mod 32.
The previous Altivec implemention of mc_chroma assumed that i_src_stride was always mod 16.

Add support for FSF GCC version >= 4.3 on OSX.
So far, only Apple GCC version was supported.

More accurate refcost for p8x8 CAVLC
Slightly better quality, especially in non-RD mode, with CAVLC.

use lookup tables instead of actual exp/pow for AQ
Significant speed boost, especially on CPUs with atrociously slow floating point units (e.g. Pentium 4 saves 800 clocks per MB with this change).
Add x264_clz function as part of the LUT system: this may be useful later.
Note this changes output somewhat as the numbers from the lookup table are not exact.

Suppress saveptr warnings on Windows GCC

More small speed tweaks to macroblock.c

Much faster CAVLC residual coding
Use a VLC table for common levelcodes instead of constructing them on-the-spot
Branchless version of i_trailing calculation (2x faster on Nehalem)
Completely remove array_non_zero_count and instead use the count calculated in level/run coding.Note: this slightly changes output with subme > 7 due to different nonzero counts being stored during qpel RD.

fix compilation with GCC-4.3+

High Profile allows 25% higher maxbitrate/cpb
Correct level detection to take this into account.

s/nasm/yasm in VS project file

Cosmetic: update various file headers.

add date and compiler to `x264 --version`

10L in r1041

Significantly faster CABAC and CAVLC residual coding and bit cost calculation
Early-terminate in residual writing using stored nnz counts
To allow the above, store nnz counts for luma and chroma DC
Add assembly functions to find the last nonzero coefficient in a block
Overall ~1.9% faster at subme9+8x8dct+qp25 with CAVLC, ~0.7% faster with CABAC
Note this changes output slightly with CABAC RDO because it requires always storing correct nnz values during RDO, which wasn't done before in cases it wasn't useful.
CAVLC output should be equivalent.

dequant_4x4_dc assembly
About 3.5x faster DC dequant on Conroe

fix an overflow in dct4x4dc_mmx
(unlikely to have occurred in any real video)

Remove nasm support
Nasm won't correctly parse the SSE4 code introduced a few revisions ago, so we're removing support.
Users should upgrade to yasm 0.6.1 or later.

Fix rare warning messages in ratecontrol due to r1020

Fix MSVC compilation and clean up MSVC build file
Remove Release64 which never worked anyways.

Faster width4 SSD+SATD, SSE4 optimizations
Do satd 4x8 by transposing the two blocks' positions and running satd 8x4.
Use pinsrd (SSE4) for faster width4 SSD
Globally replace movlhps with punpcklqdq (it seems to be faster on Conroe)
Move mask_misalign declaration to cpu.h to avoid warning in encoder.c.
These optimizations help on Nehalem, Phenom, and Penryn CPUs.

fix indentation, whitespace cleanup, more consistent indentation of macro backslashes

Change some macros to be more sensitive to memory alignment, thus avoiding
useless loads/stores and calculations of permutation vectors.
Affected functions are all of mc_luma, mc_chroma, 'get_ref', SATD, SA8D and deblock.
Gains globally vary from ~5% - 15% on a depending on settings running on a 1.42 ghz G4.

refactor satd. 20KB smaller binary.
refactor sa8d. slightly faster.
more checkasm for hadamard.

Fix crash with threads and SSEMisalign on Phenom
Misalign mask needed to be set separately for each encoding thread.

Phenom CPU optimizations
Faster hpel_filter by using unaligned loads instead of emulated PALIGNR
Faster hpel_filter on 64-bit by using the 32-bit version (the cost of emulated PALIGNR is high enough that the savings from caching intermediate values is not worth it).
Add support for misaligned_mask on Phenom: ~2% faster hpel_filter, ~4% faster width16 multisad, 7% faster width20 get_ref.
Replace width12 mmx with width16 sse on Phenom and Nehalem: 32% faster width12 get_ref on Phenom.
Merge cpu-32.asm and cpu-64.asm
Thanks to Easy123 for contributing a Phenom box for a weekend so I could write these optimizations.

A few tweaks to decimate asm
A little bit faster on both 32-bit and 64-bit

Nehalem optimization part 2: SSE2 width-8 SAD
Helps a bit on Phenom as well
~25% faster width8 multiSAD on Nehalem

Add subme=0 (fullpel motion estimation only)
Only for experimental purposes and ultra-fast encoding.Probably not a good idea for firstpass.

Fix minor memory leak in r1022

r1024 borked checkasm
Remove idct/dct2x2 from checkasm as they are no longer in dctf

Faster chroma encoding
9-12% faster chroma encode.
Move all functions for handling chroma DC that don't have assembly versions to macroblock.c and inline them, along with a few other tweaks.

Various cosmetics and minor fixes
Disable hadamard_ac sse2/ssse3 under stack_mod4
Fix one MSVC compilation warning
Fix compilation in debug mode in certain cases on x64
Remove eval.c from MSVC project
Fix crash when VBV is used in CQP mode
Patches by MasterNobody

Faster b-adapt + adaptive quantization
Factor out pow to be only called once per macroblock.Speeds up b-adapt, especially b-adapt 2, considerably.
Speed boost is as high as 24% with b-adapt 2 + b-frames 16.

Faster CABAC residual encoding
6% faster block_residual_write_cabac in RD mode.

Fix potential crash in the case that the input statsfile is too short
Also resolve various other potential weirdness (such as multiple copies of the same error message in threaded mode).

Initial Nehalem CPU optimizations
movaps/movups are no longer equivalent to their integer equivalents on the Nehalem, so that substitution is removed.
Nehalem has a much lower cacheline split penalty than previous Intel CPUs, so cacheline workarounds are no longer necessary.
Intel for providing Avail Media with the pre-release Nehalem CPU needed to prepare these (and other not-yet-committed) optimizations.
or: Gabriel Bouvigne <>
Date:Tue Nov 4 09:56:03 2008 -0800
Fix potential infinite loop in VBV under GCC 4.2

Encoder_reconfig: esa/tesa can only be enabled if they were on to begin with
Bug report by kemuri-_9.

Fix bug in hadamard_ac SSE assembly
Some extreme inputs could cause overflows.

Full sub8x8 RD mode decision
Small speed penalty with p4x4 enabled, but significant quality gain at subme >= 6
As before, gain is proportional to the amount of p4x4 actually useful in a given input at the given bitrate.

Optimize CABAC bit cost calculation
Speed up cabac mvd and add new precalculated transition/entropy table.
Add "noup" function for cabac operations to not update the state table when it isn't necessary.
1-3% faster macroblock_size_cabac.

Replace "git-command" with "git command" in for git 1.6 support

Add assembly version of CAVLC 8x8dct interleave
Faster CAVLC encoding and RDO with 8x8dct

Add support for psy-rd/trellis to encoder_reconfig

Fix Darwin speed regression

Further improve prediction of bitrate and VBV in threaded mode

Sub-8x8 Qpel-RD in P-frames
Improves quality when using p8x4/p4x8/p4x4 subpartitions
Benefit is proportional to how many sub-8x8 partitions are used; helps most at high bitrates and low resolutions.

Faster qpel-RD
3-4% faster qpel-RD; avoid re-checking bmv/pmv during the hex search.

Some minor optimizations in RD refinement
Don't write b subpartition in CABAC RDO
Calculate nonzero count in i4x4 CAVLC RDO

Faster deblocking when p4x4 isn't used
Most of the MV checks can be skipped, resulting in faster strength calculation

Print profile and level information upon starting encode
Previously level was only printed as part of autodetect, and only in verbose mode.

Fix possible crash in trellis at very low QPs

Add assembly versions of decimate_score
3-7x faster decimation, 1-3% faster overall

Fix typo in subme8/9 lossless qpel-RD
Slightly improves compression.

Extend trellis to support luma/chroma DC and chroma AC
Small speed loss in trellis 1, slightly larger in trellis 2, but significant quality improvement.

rm gtk, avc2avi.
I don't remember why I allowed a gui into the repository in the first place. There's nothing that makes this one special relative to all the other x264 guis.
avc2avi doesn't compile since we removed the bitstream reader. And avc doesn't belong in avi.

Resolve quality regression in r996
Accidentally removed the wrong line of code.I think this classifies as a "10l".
Thanks to techouse for initial bug report and skystrife for helping me find it.

Fix minor memory leak accidentally added with the addition of b-adapt 2

Rework subme system, add RD refinement in B-frames
The new system is as follows: subme6 is RD in I/P frames, subme7 is RD in all frames, subme8 is RD refinement in I/P frames, and subme9 is RD refinement in all frames.
subme6 == old subme6, subme7 == old subme6+brdo, subme8 == old subme7+brdo, subme9 == no equivalent
--b-rdo has, accordingly, been removed.--bime has also been removed, and instead enabled automatically at subme >= 5.
RD refinement in B-frames (subme9) includes both qpel-RD and an RD version of bime.

Fix potential miscompilation of some inline asm
Caused problems under some gcc 4.x versions with predictive lossless

Replace High 4:4:4 profile lossless with High 4:4:4 Predictive.
This improves lossless compression by about 4-25% depending on source.
The benefit is generally higher for intra-only compression.
Also add support for 8x8dct and i8x8 blocks in lossless mode; this improves compression very slightly.
In some rare cases 8x8dct can hurt compression in lossless mode, but its usually helpful, albeit marginally.
Note that 8x8dct is only available with CABAC as it is never useful with CAVLC.
High 4:4:4 Predictive replaced the previous profile in a 2007 revision to the H.264 standard.
The only known compliant decoder for this profile is the latest version of CoreAVC.
As I write this, JM does not actually correctly decode this profile.
lack of support will soon change with this commit, as x264 will be (to my knowledge) the first compliant encoder.
:Fri Sep 26 09:19:56 2008 -0700
Fix typo in progress indicator when using piped input


fix bitstream writer on bigendian 64bit (regression in r903)

remove authors whose code no longer exists

more diagnostics when configure finds an unsuitable assembler

Make x264 progress indicator more concise
Now the % indicator should be readable on the header of a minimized window on Windows systems.

Fix deblocking + threads + AQ bug
At low QPs, with threads and deblocking on, deblocking could be improperly disabled.
Revision in which this bug was introduced is unknown; it may be as old as b_variable_qp in x264 itself.

Resolve possible crash in bime, improve the fix in r985

Fix rare crash issue in b-adapt
Regression *probably* in r979

Merging Holger's GSOC branch part 1: hpel_filter speedups

r980 borked weighted bime

Disable I_PCM with psy-RD
psy-RD seems to put the PCM threshold a bit lower than it should be, so PCM is now disabled under psy-RD.

Merge avg and avg_weight
avg_weight no longer has to be special-cased in the code; faster weightb

Rewrite avg/avg_weight to take two source pointers
This allows the use of get_ref instead of mc_luma almost everywhere for bipred

Use low-resolution lookahead motion vectors as an extra predictor
Improves quality considerably (0-5%) in 1pass/CRF mode, especially with lower --me values and complex motion.
Reverses the order of lowres lookahead search to improve the usefulness of the extra predictors.

Add missing free() for f_qp_offset in frame.c

Correct misprediction of bitrate in threaded mode
Improves bitrate accuracy in cases with large numbers of threads.
Loosely based on a patch by BugMaster.

Fix a case in which VBV underflows can occur
Fix a potential case where a frame might be initially allocated too low a QP, which would then have to be raised a low during row-based ratecontrol.
In some cases, this could even produce VBV underflows in 2pass mode.

Use correct format specifier for uint64_t

Cache motion vectors in lowres lookahead
This vastly speeds up b-adapt 2, especially at large bframes values.
This changes output because now MV prediction in lookahead only uses L0/L1 MVs, not bidir.This isn't a problem, since the bidir prediction wasn't really correct to begin with, so the change in output is neither positive nor negative.
This also allowed the removal of some unnecessary memsets, which should also give a small speed boost.
Finally, this allows the use of the lowres motion vectors for predictors in some future patch.

Fix regression in b-adapt patch: encoder_open failed for multipass encodes without bframes.

Stop SAR in y4m input from overriding --sar on commandline

hadamard_ac for psy-rd
c version is 1.7x faster than satd+sa8d+sad
ssse3 version is 2.3x faster than satd+sa8d+sad

Psychovisually optimized rate-distortion optimization and trellis
The latter, psy-trellis, is disabled by default and is reserved as experimental; your mileage may vary.
Default subme is raised to 6 so that psy RD is on by default.

Add optional more optimal B-frame decision method
This method (--b-adapt 2) uses a Viterbi algorithm somewhat similar to that used in trellis quantization.
Note that it is not fully optimized and is very slow with large --bframes values.
It also takes into account weightb, which should improve fade detection.
Additionally, changes were made to cache lowres intra results for each frame to avoid recalculating them.This should improve performance in both B-frame decision methods.
This can also be done for motion vectors, which will dramatically improve b-adapt 2 performance when it is complete.
This patch also reads b_adapt and scenecut settings from the first pass so that the x264 header information in the output file will have correct information (since frametype decision is only done on the first pass).

Move adaptive quantization to before ratecontrol, eliminate qcomp bias
This change improves VBV accuracy and improves bit distribution in CRF and 2pass.
Instead of being applied after ratecontrol, AQ becomes part of the complexity measure that ratecontrol uses.
This allows for modularity for changes to AQ; a new AQ algorithm can be introduced simply by introducing a new aq_mode and a corresponding if in adaptive_quant_frame.
This also allows quantizer field smoothing, since quantizers are calculated beofrehand rather during encoding.
Since there is no more reason for it, aq_mode 1 is removed.The new mode 1 is in a sense a merger of the old modes 1 and 2.
WARNING: This change redefines CRF when using AQ, so output bitrate for a given CRF may be significantly different from before this change!

Fix crash when using b-adapt at resolutions 32x32 or below.
Original patch by BugMaster, but was mostly rewritten in order to make b-adapt actually *work* at such resolutions, not merely stop crashing.

Add title-bar progress indicator under WIN32
Also add bitrate-so-far output when piping data to x264 (total frames not known)
Patch mostly by recover from Doom9.

Revert part of r963
In some rare (but significant) cases, the optimized nal_encode algorithm gave incorrect results.

Predict 4x4_DC asm
Also remove 5-year-old unnecessary #define that reduced speed unnecessarily under MSVC-compiled builds

Faster NAL unit encoding and remove unused nal_decode
Small speedup at very high bitrates

CAVLC cleanup and optimizations
Also move some small functions in macroblock.c to a .h file so they can be inlined.

Faster avg_weight assembly
Unrolling the loop a bit improves performance

Faster H asm intra prediction functions
Take advantage of the H prediction method invented for merged intra SAD and apply it to regular prediction, too.

Add merged SAD for i16x16 analysis
Roughly 30% faster i16x16 analysis under subme=1

Add sad_aligned for faster subme=1 mbcmp
Distinguish between unaligned and aligned uses of mbcmp
SAD_aligned, for MMX SADs, uses non-cacheline SADs.

Improve progress indicator
Show average bitrate so far during encoding
Decrease update interval for longer encodes (max of 10 frames encoded between updates)

Fix speed regression in r951
Row SATDs are only necessary in VBV mode, so don't need to be checked if VBV is off.

zigzag asm

fix SOFLAGS used when building gtk frontend
patch by Markus Kanet %darkvision A gmx P eu%

remove the distinction between itex and ptex
(changes 2pass statsfile format)

hardcode the ratecontrol equation, and remove the rceq option

Fix some uses of uninitialized row_satd values in VBV
Resolves some issues with QP51 in I-frames with scenecut

Activate trellis in p8x8 qpel RD
Also clean up macroblock.c with some refactoring
Note that this change significantly reduces subme7+trellis2 performance, but improves quality.
Issue originally reported by Alex_W.

Improve VBV accuracy
Don't use the previous frame's row SATD as a predictor if it is too different from this frame's row SATD.

improve generation of Darwin libraries
Patch by vmrsss %vmrsss A gmail P com%

Fix compilation in gcc 3.4.x (issue in r946)
Due to a bug in gcc 3.4.x, in certain cases of inlining, the array_non_zero_int_mmx inline asssembly is miscompiled and causes a crash with --subme 7 --8x8dct.
This minor hack fixes this issue.

shut up various gcc warnings

fix a crash with invalid args and --thread-input (introduced in r921)

drop support for x86_32 PIC.

use permute macros in satd
move some more shared macros to x264util.asm


r940 broke threads

Cleanups in macroblock_cache_save/load
A bit more loop unrolling, and moving some constant code to the global init function

Deblocking code cleanup and cosmetics
Convert the style of the deblocking code to the standard x264 style
Eliminate some trailing whitespace

4% faster deblock: special-case macroblock edges
Along with a bit of related code reorganization and macroification

Add dedicated variance function instead of using SAD+SSD
Faster variance calculation

6% faster deblock: remove some clips, earlier termiantion on low qps.

Faster deblocking
Early termination for bS=0, alpha=0, beta=0
Refactoring, various other optimizations
About 30% faster deblocking overall.

asm cosmetics

yet another posix-emulating define on solaris

update msvc projectfile

drop support for msvc6

Prevent VBV from lowering quantizer too much
This code seemed to act up unexpectedly sometimes, creating a situation where in 1-pass VBV mode, a frame's quantizer would drop all the way to qpmin and then shoot back upwards to qpmax, causing serious visual issues.
This change may decrease bitrate in VBV mode, but that is preferable to the artifacting produced by this code.

Improve subme7 at low QPs and add subme7 support in lossless mode

cosmetics: merge x86inc*.asm

Add missing x264util.asm

Basic sanity checking of qpmax/qpmin options

Fix regression in r922
set the chroma DC coefficients to zero for residual coding in qpel-rd
fix C99ism

Refactor asm macros part 2: DCT

Refactor asm macros part 1: DCT

Improve intra RD refine, speed up residual_write_cabac
a do/while loop can be used for residual_write, but i8x8 had to be fixed so that it wouldn't call residual_write with zero coeffs
proper nnz handling added to cabac intra rd refine
chroma cbp added to 8x8 chroma rd
cbp was tested, but wasn't useful

Fix a few more minor memleaks

stats summary: print distribution of numbers of consecutive B-frames

add interlacing to the list of stuff checked by x264_validate_levels

Fix C99-ism in r907

Faster temporal predictor calculation
a separate commit because this changes rounding, and thus changes output slightly.
:Thu Jul 17 07:55:24 2008 -0600
Align lowres planes for improved cacheline split performance

autodetect level based on resolution/bitrate/refs/etc, rather than defaulting to L5.1
if vbv is not enabled (and especially in crf/cqp), we have to guess max bitrate, so we might underestimate the required level.

fix bs_write_ue_big for values >= 0x10000.
(no immediate effect, since nothing writes such values yet)

Fix lossless mode borked in r901

Relax QPfile restrictions
Allow a QPfile to contain fewer frames than the total number of frames in the video and have ratecontrol fill in the rest.
Patch by kemuri9.

Limit MVrange correctly in interlaced mode
Bug report by Sigma Designs, Inc.

Fix bug with PCM and adaptive quantization
In rare cases CABAC desync could occur, causing bitstream corruption

Fix memory leak upon x264 closing
Doesn't affect the CLI, but potentially important for programs which call x264 as a shared library.

Fix compilation on PPC systems (borked in r903)
Bigendian systems didn't have endian_fix32 defined

Add L1 reflist and B macroblock types to x264 info
Also remove display of "PCM" if PCM mode is never used in the encode.
L1 reflist information will only show if pyramid coding is used.

Fix and enable I_PCM macroblock support
In RD mode, always consider PCM as a macroblock mode possibility
Fix bitstream writing for PCM blocks in CAVLC and CABAC, and a few other minor changes to make PCM work.
PCM macroblocks improve compression at very low QPs (1-5) and in lossless mode.

de-duplicate vlc tables

faster ue/se/te write

faster bs_write

cosmetics in ssd asm

Various optimizations and cosmetics
Update AUTHORS file with Gabriel and me
update XCHG macro to work correctly in if statements
Add new lookup tables for block_idx and fdec/fenc addresses
Slightly faster array_non_zero_count_mmx (patch by holger)
Eliminate branch in analyse_intra
Unroll loops in and clean up chroma encode
Convert some for loops to do/while loops for speed improvement
Do explicit write-combining on --me tesa mvsad_t struct
Shrink --me esa zero[] array
Speed up bime by reducing size of visited[][][] array

Resolve floating point exception with frame_init_lowres mmx
In some cases, the mmx version of frame_init_lowres could leave the FPU uninitialized for use in ratecontrol, resulting in floating point exceptions.
Since frame_init_lowres is such a time-consuming function, an emms was just put at the end, since it costs almost nothing compared to the total time of frame_init_lowres.

Update my email address

Update file headers throughout x264
Update "Authors" lists based on actual authorship; highest is most important
Update copyright notices and remove old CVS tags from file headers
Add file headers to GTK and other sections missing them
Update FSF address
Other header-related cosmetics

denoise_dct asm

cosmetics in permutation macros
SWAP can now take mmregs directly, rather than just their numbers

Fix bug in adaptive quantization
In some cases adaptive quantization did not correctly calculate the variance.
Bug reported by MasterNobody

lowres_init asm
rounding is changed for asm convenience. this makes the c version slower, but there's no way around that if all the implementations are to have the same results.

Optimizations and cosmetics in macroblock.c
If an i4x4 dct block has no coefficients, don't bother with dequant/zigzag/idct.Not useful for larger sizes because the odds of an empty block are much lower.
Cosmetics in i16x16 to be more consistent with other similar functions.
Add an SSD threshold for chroma in probe_skip to improve speed and minimize time spent on chroma skip analysis.
Rename lambda arrays to lambda_tab for consistency.

some asm functions require aligned stack. disable these when compiling with msvc/icc.

Move bitstream end check to macroblock level
Additionally, instead of silently truncating the frame upon reaching the end of the buffer, reallocate a larger buffer instead.

Convert NNZ to raster order and other optimizations
Converting NNZ to raster order simplifies a lot of the load/store code and allows more use of write-combining.
More use of write-combining throughout load/save code in common/macroblock.c
GCC has aliasing issues in the case of stores to 8-bit heap-allocated arrays; dereferencing the pointer once avoids this problem and significantly increases performance.
More manual loop unrolling and such.
Move all packXtoY functions to macroblock.h so any function can use them.
Add pack8to32.
Minor optimizations to encoder/macroblock.c


checkasm --bench=function_name

interleave psnr/ssim computation with reference frame filtering, to improve cache coherency

Add more inline asm and a runtime check for MMXEXT support
x264 will now terminate gracefully rather than SIGILL when run on a machine with no MMXEXT support.
A configure option is now available to build x264 without assembly support for support on such old CPUs as the Pentium 2, K6, etc.

Use aligned memcpy for x264_me_t struct and cosmetics

Cosmetics and loop unrolling
GCC is not very good at loop unrolling in cases where it can perform constant propagation, so the unrolling unfortunately has to be done manually.

Fix regression in 64-bit in r882
i_mvc needs to be 64-bit when used with a 64-bit memory pointer

More tweaks to me.c
Added inline MMX version of UMH's predictor difference test
Various cosmetics throughout me.c
Removed a C99-ism introduced in r878.

Fix regression in r736
r736 added intra RD refinement to B-frames; however, it is possible for subme=7 to be used without b-rdo.
This means intra RD isn't run, and therefore it is possible for intra chroma analysis to not have been run, since update_cache was never called for an intra block, and chroma ME is not required even at subme=7.
r801, which removed a memset, made this worse because previously the chroma prediction mode was at least initialized to zero; now it was not initialized at all.
Therefore, --no-chroma-me, --subme 7, and no --b-rdo had the potential to crash.
This change restricts intra RD refinement to only be run when --b-rdo is enabled (sensible to begin with), thus preventing a crash in this case.

Fix regression in r850
Bug resulted in rare incorrect chroma encoding

Cosmetics in VBV handling

Tweaks and cosmetics in me.c
Use write-combining for predictor checking and other tweaks.

Partially inline trellis quantization
Inlining trellis into the 4x4/8x8 trellis wrappers increases trellis speed by about 5-10% through constant propagation.

Various cosmetic changes.


many changes to which asm functions are enabled on which cpus.
with Phenom, 3dnow is no longer equivalent to "sse2 is slow", so make a new flag for that.
some sse2 functions are useful only on Core2 and Phenom, so make a "sse2 is fast" flag for that.
some ssse3 instructions didn't become useful until Penryn, so yet another flag.
disable sse2 completely on Pentium M and Core1, because it's uniformly slower than mmx.
enable some sse2 functions on Athlon64 that always were faster and we just didn't notice.
remove mc_luma_sse3, because the only cpu that has lddqu (namely Pentium 4D) doesn't have "sse2 is fast".
don't print mmx1, sse1, nor 3dnow in the detected cpuflags, since we don't really have any such functions. likewise don't print sse3 unless it's used (Pentium 4D).

enable ssse3 phadd satd on Penryn.

benchmark most of the asm functions (checkasm --bench).

Cosmetic: fix C99-ism

Use a gaussian window for cplxblur
Cplxblur was originally intended to use a gaussian window, but in its current form did not.This change provides a tiny improvement to 2pass ratecontrol.


nasm compatible NX stack

CQP is incompatible with AQ


binmode stdin on mingw, not just msvc

omit redundant mc after non-rdo dct size decision, and in b-direct rdo

allow fractional CRF values with AQ.

fix some uninitialized partitions in rdo

2-pass VBV support and improved VBV handling
Dramatically improves 1-pass VBV ratecontrol (especially CBR) and provides support for VBV in 2-pass mode.This consists of a series of functions that attempts to find overflows and underflows in the VBV from the first-pass statsfile and fix them before encoding.
1-pass VBV code partially by Dark Shikari.

Fix noise reduction in threaded mode.
Previously enabling noise reduction with threads had no effect.
Note that this is not an optimal solution; each thread still tracks noise reducation separately (unlike in single-threaded mode).

fix a crash on win32 with threads.
r852 introduced an assumption in deblock that the stack is aligned.

remove nasm version check. a feature check is all that's needed.
silence stderr in yasm version check.

cosmetics in cabac

faster residual_write_cabac

change DEBUG_DUMP_FRAME to run-time --dump-yuv

this is the first non-runtime-detected use of mmxext, but it has to be inlined

factor duplicated code out of deblock chroma mmx


write aspect ratio in mp4

omit delta_quant in i16x16 blocks with no residual
(all other block types were already covered, but i16x16 cbp is special)

explicit write combining, because gcc fails at optimizing consecutive memory accesses

force unroll macroblock_load_pic_pointers
and a few other minor optimizations


r836 borked lossless cabac nnz

use elf instead of a.out on netbsd

fix x264_realloc when not using libc realloc.

don't pretend to support win64. remove all related code.
it hasn't worked since probably some time in 2005, and won't ever be fixed unless someone steps up to maintain it.

cosmetics: replace last instances of parm# asm macros with r#


faster probe_skip

drop support for pre-SSE3 assemblers

no point in giving it a generic name when it's not generic

faster cabac_mb_cbp_luma
ported from ffmpeg

remove some redundant nnz counts
move some nnz counts from macroblock_encode to cavlc if cabac doesn't need them

compute missing nnz count in subme7 cavlc

remove a division in macroblock-level bookkeeping

omit P/B-skip mc from macroblock_encode if the pixels haven't been overwritten since probe_skip

earlier termination in SEA if mvcost exceeds residual

remove void* arithmetic from r821

Fix define of illegal function identifiers (as defined in section "7.1.3 Reserved identiers" of C99 spec)

Fix define of illegal identifier (as defined in section "7.1.3 Reserved identiers" of C99 spec) "__UNUSED__", and use the one defined in common/osdep.h, i.e. "UNUSED"
based on a patch by Diego Biurrun

more consistent include name (in line with other PPC includes)

fix illegal identifiers in multiple inclusion guards
patch by Diego Biurrun % diego A biurrun P de %

AQ now treats perfectly flat blocks as low energy, rather than retaining previous block's QP.
fixes occasional blocking in fades.

checkasm cabac


--asm to allow testing of different versions of asm without recompile

copy left neighbor pixels directly from previous mb instead of main plane

cacheline split workaround for mc_luma

add "SECTION_RODATA" before "SECTION .text" to setup the fakegot label used in macho binaries.
This fixes compilation with --enable-pic
Requires Yasm 0.7.0 or newer
Patch by Dave Lee % davelee P com A gmail P com %

more hpel fixes

update msvc projectfile

r810 borked hpel_filter_sse2 on unaligned buffers

threads=auto on multicore now implies thread input, just like explicit thread numbers already did

dct4 sse2

faster x86_32 dct8

macros to deal with macros that permute their arguments

mmx cachesplit sad of non-square sizes checked height instead of width

sfence after nontemporal stores

simplify hpel filter asm (move control flow to C) and add sse2, ssse3 versions

more mmx/xmm macros (mova, movu, movh)

improve handling of cavlc dct coef overflows
support large coefs in high profile, and clip to allowed range in baseline/main

fix shared libs on MacOSX
based on a patch by İsmail Dönmez

typo in r803

fix a crash on mp4 muxing with invalid params

variance-based psy adaptive quantization
new options: --aq-mode --aq-strength
AQ is enabled by default

fix naming of .dll on mingw

don't distinguish between mingw and cygwin

remove a memset

typo. don't evaluate rd pskip when p16x16 found ref>0.

r784 borked lossless dc zigzag

fix an arithmetic overflow that disabled SEA threshold after finding a mv with SAD < mvcost.

fix hpel_filter_altivec picked up by checkasm
Patch by Manuel %maaanuuu A % and Noboru Asai % noboru P asai A gmail P com %

faster residual

nasm doesn't like align(nop) in structs

reduce the size of some cabac arrays

use cabac context transition table from trellis in normal residual coding too

rearrange cabac struct to reduce code size

higher precision RD lambda
improves quality at QP<=12.

faster cabac_encode_ue_bypass

cabac asm.
mostly because gcc refuses to use cmov.
28% faster than c on core2, 11% on k8, 6% on p4.

cosmetics in cabac

inline cabac_size_decision

cosmetics in DECLARE_ALIGNED

don't distinguish between luma4x4 and luma4x4ac

faster lossless zigzag

more alignment

add tesa and lossless to fprofile

cosmetics in residual_write

remove unused bitstream reader

cosmetics in quant asm

special case dequant for flat matrix

faster dequant

simplify hpel_filter_c

use x264_mc_copy_w16_sse2 in mc.copy, it was previously only in mc_luma

new ssd_8x*_sse2
align ssd_16x*_sse2
unroll ssd_4x*_mmx

update altivec zigzags

r768 borked cavlc

cosmetics in intra predict

faster intra predict 8x8 hu/hd

reduce zigzag arrays from int to int16_t

reduce the size of some arrays

skip intra pred+dct+quant in cases where it's redundant (analyse vs encode)
large speedup with trellis=2, small speedup with trellis=0 and/or subme>=6

cosmetics in asm



continue instead of crash when the threading mv constraint is violated.
doesn't fix the underlying bug, but hopefully less annoying until we find it.

remove remaining reference to clip1.h

fix name mangling again.
apparently it's not just a convention, dll build fails if you try to export a non-prefixed name.

update msvc projectfile

missing #ifdef HAVE_SSE3

don't define offsetof since it's standard

shut up gcc warning in offsetof

increase alignment of mv arrays


checkasm check whether callee-saved regs are correctly saved
x86_32 only for now since x86_64 varargs are annoying

fix x86_32 ads which failed to preserve a register

fix some name mangling issues introduced by the merge

remove x264_mc_clip1.
it's wrong for sufficiently perverse inputs, and clip_uint8 is faster anyway.

merge x86_32 and x86_64 asm, with macros to abstract calling convention and register names

git compatible version script

check for broken versions of yasm

increase the alignment of the i8x8 edge cache, needed for sse2 intra prediction.
patch by Alexander Strange.


pic macros now keep track of which register holds the GOT, so variable access doesn't have to care

remove x86_64 predict_8x8_ddl_mmxext because sse2 is faster even on amd

cosmetics in dsp init

sse2 16x16 intra pred.
port the remaining intra pred functions from x86_64 to x86_32.
patch by Dark Shikari.

some simplifications to mmx intra pred that should have been done way back when we switched to constant fdec_stride.
and remove pic spills in functions that have a free caller-saved reg.
patch partly by Dark Shikari.

faster array_non_zero

x86_32 sse2 idct8
ported from ffmpeg by Dark Shikari

checkasm: relax the threshold for floating-point ssim

checkasm: test idct with the range of coefficients what can really be encountered, as opposed to random numbers which might overflow.

intra_rd_refine in B-frames

print average of macroblock QPs instead of frame's nominal QP

update date

remove colorspace conversion support, because it has no business in any codec

misc fixes in checkasm

remove a useless bit of me=umh (originally copied from JM, where it was used for something)

fix a memleak in cqm

fix a memleak in mkv muxer
patch by saintdev

satd exhaustive motion search (--me tesa)

fix cabac context for nonzero delta_qp of the 2nd mb of a frame in interlaced mode

fix mapping of mvs to partitions in p4x4_chroma
patch by Noboru Asai

fix mvp for b16x8 and b8x16 L1 search
patch by Wei-Yin Chen

shave a couple cycles off cabac functions

faster and smaller x264_macroblock_cache_mv etc

configure test for endianness

change the meaning of --ref: it now selects DPB size (including B-frames), rather than L0 size (which B-frames are added to)

add / fix support for FreeBSD, based on a patch by Igor Mozolevsky % igor A hybrid-lab P co P uk %

shut up some valgrind warnings

slightly wrong memory allocation in r717, fixes a potential crash with merange>32

convert absolute difference of sums from mmx to sse2
convert mv bits cost and ads threshold from C to sse2
convert bytemask-to-list from C to scalar asm
1.6x faster me=esa (x86_64) or 1.3x faster (x86_32). (times consider only motion estimation. overall encode speedup may vary.)

round esa range to a multiple of 4

use define _WIN32 instead of __WIN32__ or WIN32 defines.
NSDN reference:
Patch by BugMaster %BugMaster A narod P ru%
Original thread:
date: Dec 27, 2007 3:18 AM
subject: [x264-devel] VS2008 compilation error (need of replacement __WIN32__ with _WIN32)

tweak x264_pixel_sad_x4_16x16_sse2 horizontal sum. 168 -> 166 cycles on core2.

fix a nondeterminism involving 8x8dct, rdo, and threads.

also test arch-specific x264_zigzag_* implementations in checkasm.c
patch by Patch by Noboru Asai % noboru P asai A gmail P com%

Add AltiVec implementation of
- x264_zigzag_scan_4x4_frame_altivec()
- x264_zigzag_scan_4x4ac_frame_altivec()
- x264_zigzag_scan_4x4_field_altivec()
- x264_zigzag_scan_4x4ac_field_altivec()
each around 1.3 tp 1.8x faster than C version
Patch by Noboru Asai % noboru P asai A gmail P com%

adds AliVec implementation of predict_16x16_p()
over 4x faster than C version

revert the x86_32 part of r708. elf shared libraries aren't important enough to be worth the extra lines of code to check for nasm.

mark asm functions as hidden

check whether ld supports -Bsymbolic before using it

reduce the data type used in some tables. 16KB smaller exe.

faster removal of duplicate mv predictors

avoid a division in x264_mb_predict_mv_ref16x16.
patch by Dark Shikari.

avoid a division in umh.
patch by Dark Shikari.

fix a memleak in h->mb.mvr

fix compilation as a shared library on x86_64 (regression in r696)

add support for x86_64 on Darwin9.0 (Mac OS X 10.5, aka Leopard)
Patch by Antoine Gerschenfeld %gerschen A clipper P ens P fr%

cover some more options in fprofile. (esa, bime, cqm, nr, no-dct-decimate, trellis2)
previously, esa was slower with fprofile than without, since gcc thought it wasn't important. now esa benefits like anything else.

Add AltiVec implementation of x264_pixel_ssd_8x8, 3x faster than C version
Overall speed-up: 0.7% with--bframes 3 --ref 5 -m 7 --b-rdo
Patch by Noboru Asai %noboru P asai A gmail P com%

limit mvs to [-512,511.75] instead of [-512,512]

avoid memory loads that span the border between two cachelines.
on core2 this makes x264_pixel_sad an average of 2x faster. other intel cpus gain various amounts. amd are unaffected.
overall speedup: 1-10%, depending on how much time is spent in fullpel motion estimation.

add cache info to cpu_detect. also print sse3.

cosmetics: reorder mc_luma/mc_chroma/get_ref arguments for consistency with other functions

separate pixel_avg into cases for mc and for bipred

add AltiVec implementation of ssim_4x4x2_core, about 4x faster than C version.
Overall: 0.1-0.2% faster with default encoding settings
Patch by Noboru Asai %noboru P asai A gmail P com%

Add AltiVec implementation ofx264_hpel_filter. Provides a 10-11% overall speed-up with default encoding options
Patch by Noboru Asai %noboru P asai A gmail P com%

cosmetics in dsp function selection

remove sad_pde. it's been unused ever since successive elimination replaced it.

cosmetics: use symbolic constants for frame padding radius

move hpel_filter cpu detection to a function pointer like everything else

cosmetics: use separate variables for frame width and stride

Add AltiVec implementation of add4x4_idct, add8x8_idct, add16x16_idct, 3.2x faster on average
1.05x faster overall with default encoding options
Patch by Noboru Asai % noboru DD asai AA gmail DD com %

add AltiVec implementation of dequant_4x4 and dequant_8x8, 2.8x faster than C,
1.01x faster than previous revision with default encoding options
Patch by Noboru Asai % noboru DD asai AA gmail DD com %

Add AltiVec implementation of quant_2x2_dc,
fix Altivec implementation of quant_(4x4|8x8)(|_dc) wrt current C implementation
Patch by Noboru Asai % noboru DD asai AA gmail DD com %

fix a possible nondeterminism with me=umh + threads.

use hex instead of dia for rdo mv refinement. ~0.5% lower bitrate at subme=7.
patch by Dark Shikari.

port sad_*_x3_sse2 to x86_64

don't overwrite pthread* namespace, because system headers might define those functions even if we don't want them

faster 4x4 sad

fix an arithmetic overflow in trellis at high qp.

implement multithreaded me=esa

fix some integer overflows. now vbv size can exceed 2 Gbit.

allow --vbv-init to take absolute values (in kbit), in addition to the previous fractions of vbv-bufsize.

remove a bashism

reorder headers so that largefile support is defined before the first copy of stdio

regression in r669: broke saving of configure args if make has to re-run configure

regression in r669: --enable-shared should imply --enable-pic on some archs.

* Add a --host flag to allow overriding config.guess; this is particularly
useful with a 64-bits kernel running a 32-bits userland to build 32-bits
* Normalize any host triplet into a quadruplet via config.sub.
* Move option parsing before any use of architecture information.

* Update config.guess.

mingw doesn't have strtok_r

move os/compiler specific defines to their own header

extend zones to support (some) encoding parameters in addition to ratecontrol.


limit vertical motion vectors to +/-512, since some decoders actually depend on that limit.

Add vertical and horizontal luma deblocking accelerated with Altivec,
based on Graham Booker's code written for FFmpeg with slight modifications
to re-use x264's macros

cosmetics in cpu detection

fix compilation without asm on x86_32 (r658 worked only on x86_64).

exempt 1080p from the non-mod16 warning.

require a ratecontrol method to be specified, it no longer defaults to cqp=26.

fix nnz computation in cavlc+8x8dct+deblock. (regression in r607)

fix the computation of bits used for vbv. (regression in r651)

c89 compile fix

cabac: use bytestream instead of bitstream.
35% faster cabac, 20% faster overall lossless, ~1% faster overall at normal bitrates.

remove the restriction on number of threads as a function of resolution (it was wrong anyway in the presence of B-frames), and raise the max number of threads in general (though more will have to be done before it can really scale to lots of cores).

tweak ssse3 quant

change some tables from int to int8_t. 13KB smaller executable.

faster cabac rdo. up to 10% faster at q0, but negligible at normal bitrates.

workaround gcc's inability to align variables on the stack.
this crash was introduced in r642, but only because previous versions didn't use sse2 on the stack.

32bit version of ssse3 satd.
switch default assembler to yasm. it will still fallback to nasm if you don't have yasm.

simplify trellis

fix an arithmetic overflow in trellis with QP >= 42

2x faster quant. 2% overall.
side effects:
not bit-identical to the previous algorithm.
while the new algorithm covers a wider range of cqms than the previous one did,
I couldn't find a good way to fallback to a general version for the extreme
cqms. so now it refuses to encode extreme cqms instead of just being slower.
lays a framework for custom deadzone matrices, though I didn't add an api.

when encoding with a cqm, probe_skip now also uses the cqm, instead of the flat matrix

cosmetics in asm macros

in hpel search, merge two 16x16 mc calls into one 16x17. 15% faster hpel, .3% overall.

Compile fix

remove private stuff from public headers. no more need for -D__X264__

adjust bitstream buffer sizes for very large frames

conflate HAVE_MMXEXT with HAVE_SSE2, since they were never used distinctly.

* Made -DNEED_ALTIVEC unnecessary, thanks to Guillaume Poirier.

* check x264_cpu_detect() before calling AltiVec functions.

ssse3 detection. x86_64 ssse3 satd and quant.
requires yasm >= 0.6.0

* Use -maltivec when building dependencies, or <altivec.h> cannot be used.
* Do not declare vectors in non-AltiVec files.

* common/cpu.c: runtime AltiVec autodetection on Linux.
* configure, Makefile: do not build the whole project with -maltivec because
it generates AltiVec code in weird places.

fix a small memleak.
patch by Limin Wang.

compile fix for GCC-3.3 on OSX, based on a patch by
Patrice Bensoussan % patrice P bensoussan A free P fr%
Note: regression test still do not pass with GCC-3.3,
but they never did as far as I can remember.

cosmetics in regression test

oops, scenecut detection failed to activate when using threads and not using B-frames

extras/getopt.c was BSD licensed. replace with a LGPL version (from glibc).

Fix build issues on Linux. Only gcc-4.x is supported, as on OSX.
Cleans up a few inconsistencies in the code too.

tweak block_residual_write_cavlc.
up to 1% faster lossless, no difference at normal bitrates.

don't assume int is exactly 4 bytes

make array_non_zero() compatible with -fstrict-aliasing

Honor CFLAGS and LDFLAGS set by the user

Check whether 'echo -n' works, otherwise try printf (fixes build on current OS X 10.5)

Check version of nasm on OS X / Intel

wrong reference frames were used with refs>=14 + pyramid (regression in r607)

enable thread synchronization primitives on linux too

fix a crash with x264_encoder_headers() + threads

don't skip autodection on configure --enable-pthread

more win32threads -> pthreads

cosmetics: rename list operators to be consistent with Perl, and move them to common/

win32: use pthreads instead of win32threads. for some reason, pthreads is much faster.

New threading method:
Encode multiple frames in prallel instead of dividing each frame into slices.
Improves speed, and reduces the bitrate penalty of threading.
Side effects:
It is no longer possible to re-encode a frame, so threaded scenecut detection
must run in the pre-me pass, which is faster but less precise.
It is now useful to use more threads than you have cpus. --threads=auto has
been updated to use cpus*1.5.
Minor changes to ratecontrol.
New options: --pre-scenecut, --mvrange-thread, --non-deterministic

* Do not assume anything about sizeof(cpu_set_t).

* Add support for kFreeBSD (FreeBSD kernel with GNU userland).

Add Altivec implementations of add8x8_idct8, add16x16_idct8, sa8d_8x8 and sa8d_16x16
Note: doesn't take advantage of some possible aligned memory accesses, so there's still room for improvement

Force alignment of the fake .rodata on MacIntel

don't treat vbv_maxrate as a minrate too if it's higher than target average bitrate.

Merges Guillaume Poirier's AltiVec changes:
* Adds optimized quant and sub*dct8 routines
* Faster sub*dct routines
~8% overall speed-up with default settings

10% faster deblock mmx functions. ported from ffmpeg.

checkasm: ignore insignificant differences in floating-point ssim

display final ratefactor in abr when a loose vbv is applied. (still disabled in true cbr)

fix parsing of --deblock %d,%d(beta was ignored)

compute chroma_qp only once per mb

rd refinement of intra chroma direction (enabled in --subme 7)
patch by Alex Wright.

fix a crash in avc2avi

skip deblocking and motion interpolation when using only I-frames


allow fractional values of crf

prefetch pixels for motion compensation and deblocking.

fix a crash on interlace + >8 reference frames

no more decoder. it never worked anyway, and the presence of defunct code was confusing people.

compute pskip_mv only once per macroblock, and store it

slightly faster chroma_mc_mmx

missing emms in plane_copy_mmx

merge center_filter_mmx with horizontal_filter_mmx

1.5x faster center_filter_mmx (amd64)

mmx/prefetch implementation of plane_copy

no more vfw

gtk fixes:
in Makefile
- fix datadir for mingw users
- remove the shared lib during the clean rule
- use $(ENCODE_BIN) instead of x264_gtk_encode
- add some $(DESTDIR) and create some directories when necessary
- remove -lintl
statfile_length -> statsfile_length
fix the "sensitivity" of the widget of update_statfile
the logo is now handled correctly on windows
added: beginning of multipass support
patch by Vincent Torri.

accept mencoder's option names as synonyms (api only, not in x264cli)

simplify satd_sse2

better error checking in x264_param_parse.
add synonyms for a few options.

fix some strides that weren't a multiple of 16.

tweak motion compensation amd64 asm. 0.3% overall speedup.

strip local symbols from asm .o files, since they confuse oprofile

add an option to control direct_8x8_inference_flag, default to enabled.
slightly faster encoding and decoding of p4x4 + B-frames,
and is needed for strict Levels compliance.

allow custom deadzones for non-trellis quantization.
patch by Alex Wright.

move zigzag scan functions to dsp function pointers.
mmx implementation of interlaced zigzag.

support interlace. uses MBAFF syntax, but is not adaptive yet.

allow --zones in cqp encodes

cli: fix some typos in vui parameters from r542.
patch by Foxy Shadis.

* Add an "all" rule to the Makefile. Ideally "default" should be renamed,
but I don't want to break existing scripts.

workaround: on some systems, alloca() isn't aligned

missing picpop

fix a buffer overread from r540

cosmetics (spelling)

faster ESA

faster ESA

* Use the autotool's config.guess script instead of uname to check the
system and CPU types, to avoid issues when using for instance a 32-bit
userland on top of a 64-bit kernel.

* Add the autotool's config.guess script so that we can use it instead
of uname in the configure script.

10l in r553

ssim broke on amd64 w/ pic.

support changing some more parameters in x264_encoder_reconfig()

SSIM computation. (default on, disable by --no-ssim)

configure: --enable-debug reduces optimization to -O1


gcc -fprofile-generate isn't threadsafe

cli: move some options from --help to --longhelp

cli: don't try to get resolution from filename unless input is rawyuv

r542 broke --visualize

Nicer OS X x264_cpu_num_processors (thanks David)

Support OS X and BeOS in x264_cpu_num_processors

Fixes contexts allocation with threads=auto

select initial qp for abr and cbr baased on satd and bitrate, rather than cq24.

--threads=auto to detect number of cpus

api addition: x264_param_parse() to set options by name

fix a rare NaN in ratecontrol

move quant_mf[] from x264_t to the heap, and merge duplicate entries

GTK update. patch by Vincent Torri.
cleaning of Makefile
time elapsed seems broken ('total time' label replaced by 'time remaining')
text entries of the status window are now not editable
compilation from x264/ (add --enable-gtk option to configure)
shared lib creation if --enable-shared is passed to configure
--b-rdo, --no-dct-decimate

new option: --qpfile forces frames types and QPs.
(intended for ratecontrol experiments, not for real encodes)

api change: select ratecontrol method with an enum (param.rc.i_rc_method) instead of a bunch of booleans.

slightly faster mmx dct

OpenBSD build fixes.
patch by Vizeli Pascal (pvizeli at yahoo dot de)

mc_chroma width2 mmx

make symlink relative

GTK update. patch by Vincent Torri.
tooltips (without descriptions yet)
`make clean` for .exe
when file exists, ask for override
debug level bug
bitrate slider bug
mixed-refs can be set only if ref>1
i8x8 can be set only if 8x8 transform is enabled
# of threads capped at 4
fourcc can't be removed

vfw installer: tweak nsis compression.
patch by Francesco Corriga.

Fixed typo that caused x264_encoder_open to always fail

check some mallocs' return value

make -> $(MAKE)

convert non-fatal errors to message level "warning".

fix a memory alignment. (no effect on x86, but might be needed for other simd)

when using DEBUG_DUMP_FRAME, write decoded pictures in display order.
patch by Loic Le Loarer.

non-referenced B-frames should have the same frame_num as the following ref frame, not the previous.
patch by Loic Le Loarer.

set the SPS constraint_set[01]_flag based on the profile in use, just in case some decoder cares

msvc doesn't like C99 named array initializers

allow sar=1/1.
patch by Loic Le Loarer.

faster intra search: filter i8x8 edges only once, and reuse for multiple predictions.

faster intra search: some prediction modes don't have to compute a full hadamard transform.
x86 and amd64 asm.

--sps-id, to allow concatenating streams with different settings.

typo in expand_border_mod16

typo impaired 2pass bitrate prediction.

Let the user choose the compiler with "CC=xxx ./configure"

More vector types fixes for gcc 3.3

More vector casts to try and make compilers happier

Use sa8d instead of satd for i8x8 search.
+.01 dB, -.5% speed

Before evaluating the RD score of any mode, check satd and abort if it's much worse than some other mode.
Also apply more early termination to intra search.
speed at -m1:+1%, -m4:+3%, -m6:+8%, -m7:+20%

* common/ppc/pixel.c: fixed illegal implicit casts of vector types.

* Added %$#@#$! support for #@%$!#@ armv4l CPU.

When evaluating predictors to start fullpel motion search, use subpel positions instead of rounding to fullpel.
about +.02 dB, -1.6% speed at subme>=3
patch by Alex Wright.

mmx implementation of x264_pixel_sa8d

10l in r463 (q0 i16x16 dc was permuted)

typo in r504

update msvc project files.
patch by anonymous.

Before, we eliminated dct blocks containing only a small single coefficient. Now that behavior is optional, by --no-dct-decimate.
based on a patch by Alex Wright.

Enables more agressive optimizations (-fastf -mcpu=G4) on OS X.
Adds AltiVec interleaved SAD and SSD16x16.
Overall speedup up to 20%.
Patch by anonymous

faster cabac_encode_bypass

restored AltiVec dct

more AltiVec mc, ~4.5% overall speedup

slightly faster loopfilter

3% faster satd_mmx

cosmetics in sad/ssd/satd mmx

store quoted configure options. needed e.g. for multiple args under --extra-cflags.

fix a yasm-incompatible syntax in x86 asm

yasm noexec stack

more interleaved SAD.
25% faster halfpel.

more interleaved SAD.
1% faster umh, 6% faster esa.

interleave multiple calls to SAD.
15% faster fullpel motion estimation.

* Added support for ppc64. I'm really fucking tired of having to do this.

use LDFLAGS when linking shared lib

GTK: support yuv4mpeg input.
patch by Vincent Torri.

GTK: fix avs input
patch by Vincent Torri.

cli: support yuv4mpeg input.
patch by anonymous.

GTK: compilation fixes

GTK: compilation fixes on mingw,
add avs input for the app (if avalaible),
add filters for the filechooser,
add icon for the main window.
patch by Vincent Torri.

GTK-based graphical frontend.
patch by Vincent Torri.

silence some gcc warnings

use FDEC_STRIDE instead of a parameter in mmx dct
.5% speedup

* configure: support for 64 bits MIPS.

10l in r473 and stdin

RD subpel motion estimation (--subme 7)

cosmetics in cabac_mb_cbf

separate --thread-input from --threads

if --threads > 1, then read the input stream in its own thread.

FreeBSD uses ELF

10l in r470 on x86_64

some mmxext functions really only required mmx.

simplify get_ref and mc_luma

b16x16 wpred analysis used wrong weight

configure: --enable-shared for

wrong modulus when delta_qp = +26

10l in vbv + 2pass

macroblock-level ratecontrol: improved vbv strictness, and improved quality when using vbv.

keep transposed dct coefs. ~1% overall speedup.

tweak rounding of 8x8dct

cosmetics in makefile

cosmetics: muxers -> muxers.c

no --nr in intra blocks. intra prediction doesn't work well enough for the residual to be indicative of noise.

10l in direct auto + multiref + 1pass

--direct auto
selects direct mode per frame. works best in 2pass (enable in both passes).

change default direct mode to spatial

remove TODO. most of it is done, and the rest is out of date.

more amd64 mmx intra prediction

for i8x8 neighbors, don't assume a new slice starts at the edge of the frame

* common/i386/i386inc.asm: got PIC to work for real on OS X x86.

* common/i386/*.asm: don't use the "GLOBAL" reserved word, some versions
NASM complain about it. Replaced it with "GOT_ebx".

* configure: activate minor nasm optimisations, such as assembling
"add eax, 8" as "add eax, byte 8".

* common/i386: factored the .rodata section declaration into i386inc.asm.

* configure common/i386/i386inc.asm: got rid of -DFORMAT_* nasm flags
and use built-in preprocessor tests instead.

* common/i386/i386inc.asm: tell the ELF linker about our stack properties
so that it does not assume the stack has to be executable.

10l in r443 (p4x4 chroma)

copy current macroblock to a smaller buffer, to improve cache coherency and reduce stride computations.
part 3: asm

copy current macroblock to a smaller buffer, to improve cache coherency and reduce stride computations.
part 2: intra prediction

copy current macroblock to a smaller buffer, to improve cache coherency and reduce stride computations.
part 1: memory arrangement.


lowres intra used wrong neighboring pixels

trellis=2 slightly affected intra analysis even without subme=6

* encoder/ratecontrol.c: OS X support for exp2f and sqrtf.

allow delta_qp > 26

ratecontrol didn't always account for header bits, causing an undersize in multipass with --ratetol inf.

-q0 --b-rdo wasn't lossless


allow ',' separator for --filter

VfW: 10l in bime and refs

more lowres mv clipping fixes

VfW: cosmetics

VfW: support trellis, brdo, nr, bime.
patch by Dan Nelson (dnelson at allantgroup dot com).

amd64 mmx for some intra pred functions

dequant_mmx made incorrect assumptions about extreme inputs. now uses 32bit in more cases.
patch by Christian Heine.

lowres can reuse the normal mv cost table

r422 broke x264_center_filter_mmxext

* configure: define FORMAT_ELF under Linux and FORMAT_AOUTB under *BSD.

* common/i386/i386inc.asm: support for ELF, a.out and Mach-O objects.

* configure: added a --enable-pic flag.

* Additional fixes to the PIC versions of assembly routines. They now pass
all checkasm tests and output streams are bit-by-bit identical, which
sounds good.

* tools/checkasm.c: print the random seed used for the test, to allow for
replays. It looks like dequant_4x4 fails 1 time out of 600, with the
following seeds for instance: 1423 1957 2149 2455 3385 3403 3724 4095.

cosmetics in mc_chroma

* Oh, so what I thought was unused code was in fact used. This fixes my
breakage but makes the code rather slow in PIC mode. I will fix it later.

* Support for x86 position-independent code (PIC), needed for dynamic libs
on Mac OS X Intel. I tried to make this as little intrusive as possible.

msvc: #define isfinite()

x86 mmx for some intra pred functions

cosmetics: reorganize intra prediction dsp

too many systems don't have off_t; use uint64_t instead.

fix order of frame evaluation in pre-me

update AUTHORS

fix a check for NaN in ratecontrol

fix mv predictors in pre-me for b-adapt.

print --nr in sei params. tweak ratecontrol param checking.

I've moved

write correct VUI timing info

early termination in UMH search

split mv_range enforcement from edge-of-frame clipping. fixes an occasional artifact with long mvs.

cosmetics: suppress warning on unused variables

cosmetics: simplify #includes

* configure: NSLU2 platform support (why oh why)

Re-enabled x86 optims on MacIntel, assume Nasm CVS is installed and
-f macho -DPREFIX just seems to do the job

Quick compile fix for OS X / Intel
Optimizations are disabled at the moment. In order to get them to
work, we'd need either nasm to be able to output Mach-O object files,
or we should convert the assembly code to something OS X can handle,
like gas.

cli: large file support

dct-domain noise reduction (ported from lavc)

early termination within large SADs. ~1% faster UMH, ~4% faster ESA.

mkv: increase nalu size size to 4 bytes.
patch by Haali.

less 64bit math: 12% faster trellis

more error checking of input parameters

always write sps.vui

use some extra packing modes for CQM headers.
fix typo in --cqm4p[yc].

MSVC compatibility fixes

joint bidirectional motion refinement (--bime)

fix some overflows in mp4 timestamps.
patch by Francesco Corriga.

Successive elimination motion search: same as exhaustive search, but 2-3x faster.

Fixed cc_check on OS X (gcc -o /dev/null always fails)

postpone pskip decision until after p16x16ref0 motion search.
reduces the number of erroneous pskips in low-detail regions.

configure: autodetect gpac, avis, pthread, vfw

patch by Alex Wright.

cosmetics: config.h is now modified only by configure. make now calls configure if you haven't.

MP4: set "track enabled" flag.
patch by Robert Swain.

faster subpel motion search.
patch by Alex Wright.

don't use gnu extensions to grep and sed.

pkg-config: major.minor.patch version

`make fprofiled` to automate gcc -fprofile-generate/use


param.b_repeat_headers (not yet used)

support pkg-config.
patch by Caro.

write encoding options to the userdata SEI and to the 2pass statsfile.
check for incompatible options in the 2nd pass.

change default level to "5.1"

skip dequant+idct of decimated blocks.

after a 1pass ABR, print the value of --crf which would result in the same bitrate.

subpel search: always check mvp.

faster b-rdo (skip RD of modes with bad SATD).
patch by Alex Wright.

RD mode decision for B-frames (--b-rdo)
patch by Alex Wright.

* common/amd64/quant-a.asm: added missing GLOBAL flags that prevented PIC
builds, thanks to Anssi Hannula.

* configure: added the Alpha platform.

use array_non_zero() when we don't need a full array_non_zero_count()

mmx dequant. up to 3% speedup w/ RD.

allow --level to understand names in addition to idc

check (most of) the levels constaints.
set default max_mv_range based on level_idc.

if p16x16 RD decides to code a MB as p_skip, then don't check smaller partitions.

Trellis RD quantization.
around +.2 dB

cosmetics: XCHG macro

skip a few duplicate candidates in qpel search.

skip a few duplicate candidates in fullpel hex&umh search.

cli: arithmetic overflow in bitrate printing

cosmetics in x264_cabac_mb_type

X264_ABS => abs

amd64 sse2 8x8dct. 1.45x faster than mmx.

allow 1pass ratecontrol with keyint=1

cli: print estimated time left in --progress


rm doc/dct.txt

in constant QP mode, write that QP in the PPS to save a few bits in each slice header.

faster decimation

cosmetics: fix an erroneous warning from r340.

cosmetics: change literal cabac_block_cat to an enum.

cabac: merge i_state with i_mps. bs_write multiple bits at once.

remove unused adaptive cabac_idc code

Fixed compilation on PPC (spotted by David Wolstencroft)

mmx deblocking.
2.5x faster deblocking functions, 1-4% overall.

If frame count is known at init time (cli & vfw), then abort if the 2nd pass
exceeds the length of the 1st pass.
If it's not known (mencoder), then report a non-fatal error when we run off the
end of the 1st pass stats, and switch to constant QP.

move checkasm to tools/
delete unused stuff in testing/
`make clean` deletes checkasm and avc2avi

checkasm: check 8x8dct, mc average, quant, and SSE2.

r336 broke amd64 x264_pixel_sad_16x16_sse2 (though it's not being used)

Windows 64bit asm.
patch by squid_80.

delete build/cygwin because it's handled in the main configure/makefile.

--crf: 1pass quality-based VBR.

Added --enable-gprof (patch by Johannes Reinhardt)

cosmetics: remove #if0'ed code
patch by Robert Swain.

faster bs_write

during RDO, skip the bitstream writing and just calculate the number of bits
that would be used. speedup: cabac +4-8%, cavlc +2-4%.

Use SAD instead of SATD for halfpel motion search.
Move multiref termination after halfpel search.
Total: 3-7% speedup and +/-.02 dB.
patch by Alex Wright.

VfW: mixed refs.
patch by celtic_druid.

allow non-mod16 resolutions

VfW: prevent duplicate free() in compress_end()

cosmetics: remove declarations of nonexistent asm functions

cosmetics (whitespace) in VfW

VfW: some reorganization
patch by Francesco Corriga.

cosmetics: merge some duplicate tables

remove cabac byte-stuffing code, because it just wastes bits in lossless, and does nothing at all at sane bitrates.

don't allocate lowres planes if they won't be used (i.e. in the 2nd pass).

cosmetics: move some stuff from macroblock_encode to cache_save

new option: --mixed-refs
Allows each 8x8 or 16x8 partition to independently select a reference frame, as opposed to only one ref per macroblock.
patch mostly by Alex Wright (alexw0885 at hotmail dot com).

cosmetics in option parsing

expose the rest of the VUI flags.
patch by Christian Heine.

* common/amd64/mc-a.asm: use RIP-relative addressing in PIC mode.

temporal predictors for 16x16 motion search.

slightly faster/cleaner block_residual_write_cabac


cli: fix a crash on piped input.

stats summary: separately report all 5 partition sizes, and add ref usages

disposable frames shouldn't get their own coded_frame_num.

typo in ia32 x264_pixel_avg_weight_w8_mmxext

mmx avg (already existed by not used for bipred)
mmx biweighted avg (3x faster than C)

cosmetics: move avg function ptrs from pixf to mc.

with B-pyramid, forget old refs in POC order instead of coded order.
(before, b_skip was unavailable with pyramid and ref=1)

typo in r296.
patch by lurui.

* common/amd64/*.asm: use RIP-related addressing in PIC mode.

* common/amd64/mc-a.asm: removed useless global variables

* configure: support extra $(ASFLAGS) through --extra-asflags.

reorganized VfW UI.
patch by Antony Boucher, graphic by Jarod.

MP4 output: update to GPAC 0.4 API.
patch mostly by Robert Swain.

faster mmx quant 15bit, and add 16bit version. total speedup: ~0.3%
patch by Christian Heine.

faster mmx satd. *x16: 20%, *x8: 10%, total: 2-4%.
ia32 patch by Christian Heine, amd64 port by me.

allow i4x4 and i8x8 down-left prediction with emulated top-right samples.
based on a patch by Johannes Reinhardt (Johannes dot Reinhardt at uni-konstanz dot de)

* configure: added support for ia64, mips/mipsel, m68k, arm, s390 and hppa
platforms, as well as linux sparc.

MMX quantization functions, and optimization of the C versions.
about 3x faster quant_8x8, quant_4x4, quant_4x4_dc, and quant_2x2_dc. total speedup: 4-10%.
patch by Alexander Izvorski and Christian Heine.

SSE2 pixel comparison functions
P4: SAD 16x*, SSD 16x*, SATD 16x*: 30% faster, SATD 8x8: 15% faster, total: 2-4% faster
K8: SSD 16x*: 6% faster, total: not much
patch by Alexander Izvorski.

10l in rev290: duplicate declaration of x264_pixel_sub_8x8_mmx.

mmx 8x8 dct.
On a K8: sub16x16_dct8 3806->1461, add16x16_idct8 4852->1297 cycles. total speedup: 1-3%.
patch by Christian Heine (sennindemokrit at gmx dot net)

VC++ fix (thx fenrir)

x264.h: issue an explicit warning when neither stdint.h nor inttypes.h
has be included before x264.h

VfW: SAR wording. patch by Sharktooth.

cli: workaround to allow "--ratetol inf" on win32.

* all: Patch by Mike Matsnev :
"The following things were fixed:
* AR calculation was broken on previous import
* Wrong conditional in write_nalu_mkv() was fixed
* Error checking was added in all places"

xyuv: bug fixes + autodetect of video size.

Run ranlib after make install (OS X needs that)

update i_mb_b16x8_cost_table[] for I8x8 mb type (r278 only fixed a symptom).

* all: Added matroska writing. Patch by Mike Matsnev.

* pixel.*:
"I have completed additonal SAD implementations (8x16, 16x8 and 16x16)
using Sparc VIS.Overall speedup is roughly 90% from straight C.I'm
doing development and testing on a Sun Fire V220, with 2 * 1.5ghz
I've hand-unrolled each of the loops.Sun's assembler does not appear
to have macro functionality built-in and I didn't want to establish an
external dependancy on m4.Please let me know if you run into any
trouble with the patch."
Patch by Phil Jensen.

analyse: "It correct the size of array i_mb_b16x8_cost_table
from 16 to 17,otherwise,it can result a mismatch of b16x8
mb type cost and can result memory read overflow on it." Patch by lurui.

* x264 compilation on NetBSD. Patch by Mike Matsnev.

* all: "8x8 SAD written in Sparc Assembly using VIS." Patch by Phil Jensen.

10l: rd score for sub-8x8 partitions used wrong mvs.

faster SAD_INC_2x16P for amd64.
patch by Josef Zlomek.

Fixed win32 handle leakage (thanks Trax)
Default enabled support of threads on BeOS

* Add support for UltraSparc (uname -m: sun4u) with Solaris.
Patch by Tuukka Toivonen.

* Faster SAD_INC_2x16P. Patch by Alexander Izvorski.

example quant matrix file

--cqmfile reads quant matrices in a JM-compatible format.

adjust coded buffer size based on input resolution and QP (old default wasn't enough for HD lossless)

update avc2avi for high profile

custom quant matrices

VfW: workaround a windows unicode bug.
patch by Leowai.

lossless mode enabled at qp=0

VfW: enable RDO. some option dependencies.
patch by Francesco Corriga.

rate-distortion optimized MB types in I- and P-frames (--subme 6)

more VfW options.
patch mostly by celtic_druid.

VFW: 8x8 transform, SAR.
patch by celtic_druid.

threads option in vfw.
patch by celtic_druid.

win32 threads enabled by default

vfw installer nsis script.
patch by Francesco Corriga.

print 8x8 transform usage % in stats summary.

revert 216, another try at max_dec_frame_buffering.
disable adaptive cabac_idc by default; 0 is always best anyway.

typo in cabac tables


fix i8x8 decision with chroma_me

SATD-based decision for 8x8 transform in inter-MBs.
Enable 8x8 intra.
CLI options: --8x8dct, --analyse i8x8.

Use win32 native threads (you still have to --enable-pthread to use
them, though)

slightly faster 8x8 dct

remove unused tables from SPS/PPS. reduces overhead when syncing threads.

10l (debug stuff in 246)

8x8 transform and 8x8 intra prediction.
(backend only, not yet used by mb analysis)


fix a bug with cabac + B-frames + mref + slices.
call visualization per frame instead of per slice.

accept the standard --prefix etc. options

tweak cflags

Fixed multithreading on BeOS (pthread emulation required)

multithreading (via slices)

move zones parsing to ratecontrol.c; allows passing in zones as a string.

UMHex motion seach (but no early termination yet)

Zoned ratecontrol.

fix rounding of intra dequant when qp<=3

API: x264_encoder_reconfig(). (not yet used by any frontend)

Makefile: in target "install", first create the directories if they
don't already exist

Optimized subXxX_dct


ppc/: compile fixes for Linux/PPC (courtesy of Rasmus Rohde) and
for gcc < 4

visualize reference pic numbers. misc cleanups in visualization.
patch by Tuukka Toivonen.

ppc/*: more tuning on satd (+5%)

CLI option: --seek

CLI option: --visualize
Displays the encoded video along with MB types and motion vectors.
patch by Tuukka Toivonen.

fix an uninitialized value in slicetype_analyse

port recent MC asm changes to amd64.
patch by Josef Zlomek.

+ Removed unused code
+ Optimized mc chroma 4xH and satd 8x4 and 4x8
+ Won a bunch of cycles by not trusting gcc about inlining and
unrolling properly
(about 17% faster globally)

New ratecontrol options:
1pass ABR. VBV constraint for ABR and 2pass.
There is no longer a dedicated CBR mode: use ABR+VBV.
VfW now uses ABR instead of CQP for 1st of multipass.

use a predicted mv as starting point for subpel refinement.

slight speedup in halfpel interpolation.
patch by Mathieu Monnier.

Cleaner allocation of tmp space in halfpel interpolation; fixes some valgrind/nasm warnings.
patch by Mathieu Monnier.

"2pass failed to converge" is no longer considered fatal.

Updated MSVC project files.
thanks to Bonzi.

silence some gcc warnings.
amd64 doesn't need a separate copy of the c/h files, only the asm.

10l (214 wrote wrong DPB size in SPS -> B-pyramid broke)

CLI (mp4): return to 'capture' output mode, remove useless SetCtsPackMode() (fixed in gpac).
Note: requires gpac cvs-20050419 or later.
patch by bobo.

combined L0 & L1 reference lists are limited to a total of 16 pics.

amd64 asm patch, part2.
by Josef Zlomek ( josef dot zlomek at xeris dot cz )

amd64 asm patch, part1.

Allow manual selection of fullpel ME method. New method: Exhaustive search.
based on a patch by Tuukka Toivonen.

misc makefile changes.
propogate --extra-cflags to vfw.
'make clean' removes x264.exe and vfw.
tweak dependencies.

10l (CLI: fflush after progress update)

CLI: progress indicator

VfW: build from main makefile

[mp4] ftyp & moov boxes at the begining of the file, (thanks to jeanlf
for comments)
patch by bobololo

CLI: --fps had side-effects. fixed.

CLI: cosmetics

Makefile: strip x264cli.
tweak stats summary.

* x264.c: Fix ctts box creation. Patch by bobololo from Ateme.

common/ppc: more cleaning, optimized a bit

CLI: require output file (don't default to stdout). warn if trying to use mp4 or avis when not supported. misc cleanup.

configure:use -falign-loops=16 on OS X
common/ppc/: added AltiVecized mc_chroma + cleaning
checkasm.c:really fixed MC tests

Configure tweaks. Allow avis-input in mingw. Turn off debug by default.

checkasm.c: fixed MC tests

CLI: MP4 muxing.
patch by bobo from Ateme.

Cygwin fixes

configure: ooops, restored -g
ratecontrol.c: OS X has exp2f in -lmx
checkasm: quick compile fix

add x86_64 to configure

set svn:ignore

Added a configure to detect the platform/system/etc so people don't
have to edit the Makefile (will work for Linux/OS X/BeOS/FreeBSD, feel
free to modify for others), and we can now remove the Jamfile which
was broken most of the time anyway.

Makefiles: better dependencies for SEI version number

Forgot rbsp_trailing_bits in AUD NAL

Optionally use access unit delimiter NAL units.

VfW: cleaner install on win98.
patch by Riccardo Stievano.

new util: countquant for 2pass statsfiles

print svn version number in SEI info and in CLI/VfW.

Make reconstructed frame available to caller.

make install

free() -> x264_free()

CLI: flush B-frames at the end of the encode

convert mc's inline asm to nasm (slight speedup and msvc compatibility).
patch by Mathieu Monnier.

buffer overruns in slicetype_decision.
patch by Mathieu Monnier.

tweak usage message

Simplify inter analysis option names. (psub16x16 -> p8x8)
patch by Robert Swain.

173 broke .depend when debugging was enabled

early termination for intra4x4 analysis

Check/fix range of x264_param_t.rc.i_qp_constant.

Cleaned up and fixed Makefile for OS X and BeOS (hopefully FreeBSD too)
It defaults for x86/linux, others: uncomment the lines for your
platform & OS at the beginning of the Makefile

macroblock_analyse: simplify cost comparisons. (cosmetic)
CLI: enable cabac by default.

Chroma ME (P-frames only).

SSE optimized chroma MC.
patch by Radek Czyz.

167 broke psnr calculation for non-mod-32 inputs

sqrtf requires -lmx on Mac OS X

use mmx ssd for psnr calculation.

revert 164. blame Spyder.

SSD comparison function (not yet used).
Cosmetics in mmx SAD.

VfW: reject YUY2 and RGB input formats

Really fix QP override.

write VUI bitstream restrictions

AVI & Avisynth input (win32 only).
patch by bobo from Ateme.

expose option "chroma qp offset"

Fix per-frame QP override broken in rev 137.

Don't include x264.o in the library.

VfW: expose B pyramid and weighted B prediction.
patch by Riccardo Stievano.


buffer overrun when bframes == X264_BFRAME_MAX

Adaptive B skipped some POC numbers (slightly reducing b_direct efficiency).

Use POC to determine frame boundaries (frame_num couldn't distinguish consecutive B-frames).
Fix keyframe flag to mark IDR only, not all I slices.

allow 16 refs (instead of 15)

report version number in decimal instead of hex

New option: "B-frame pyramid" keeps the middle of 2+ consecutive B-frames as a reference, and reorders frame appropriately.

smarter parsing of resolution from commandline

ratecontrol.c: fixed exp2f on BeOS so rate control works properly

Fix a buffer overrun with very long MVs.

wrong stride in lowres image

10l (fast1stpass was slower than non-fast)

Disable deblocking filter in frames of sufficiently low QP that it would have no effect. (Saves a little CPU time in the decoder.)

Simplify x264_frame_expand_border.

Altivec functions for MC using the cached halfpel planes.
Patch by Fredrik Pettersson <fredrik_pettersson at yahoo dot se>.

Don't use uninitialize MVs in x264_mb_predict_mv_ref16x16.

Implicit weights in B16x16 analysis were swapped.
patch by Radek Czyz.

Cosmetics: Some renaming. Move the rest of slice type decision from encoder.c to slicetype_decision.c

Take into account keyint_max in B-frame decision.

Preliminary adaptive B-frame decision (not yet tuned).
Fix flushing of delayed frames when the encode finishes.

Write x264's version in a SEI message.

VfW: Enable weighted B prediction when max B-frames > 1. Enforce max reference frames <= 15.
patch by Riccardo Stievano.

Add: implicit weighted prediction for B-frames.
Slightly optimize x264_mb_mc_01xywh.
Fix an error in B16x8 cost.

Oops, increment API number.

Configurable level. Levels are still not enforced; it's up to the user to select a level compatible with the rest of the encoding options.
Patch by Jeff Clagg <snacky at ikaruga dot co dot uk>.

Always use the tempfile and rename method for multipass stats, so that VfW knows whether the previous pass completed.

More tweaks to bitrate prediction.
Change error messages when 2pass fails to converge.

Improved 2pass bitrate predictor. No real change most of the time, but allows correct ratecontrol on some pathological videos that used to diverge completely. Also improves prediction when 2nd pass bitrate is very different from 1st pass.
The new qscale2bits() has no simple inverse, so I also had to change rc_eq to output qscale instead of bits.

Some defines needed by MSVC, and convert the DSP files to DOS-style newlines.
Patch by Radek Czyz.

Precalculate lambda*bits for all allowed mvs. 1-2% speedup.

Deblock B-frames. (Not yet used, since B-frames aren't kept as references.)

Simplify x264_mb_mc_01xywh()

Save some memcopies in halfpel ME.
Patch by Radek Czyz.

Cache half-pixel interpolated reference frames, to avoid duplicate motion compensation.
30-50% speedup at subq=5.
Patch by Radek Czyz.

In N-pass mode if stat_in and stat_out are the same file, instead save to a temp file and overwrite stat_in only when the encode finishes.

VfW: x264_log now creates a window for error messages


bs_align_1() didn't actually write all ones. (so encoded streams with cabac were technically invalid, though no decoder cares.)
Patch by Tuukka Toivonen.

VfW: tweak option names

VfW: use separate stats files for each pass of an N-pass encode.

VfW: Enable multipass by default, increase the configurable range of I and B quant ratios.
core: Tweak error messages.

r114 didn't completely fix the problem, trying again.

Another MV clipping fix.

Simplify x264_cabac_mb_type.

More accurate clipping rectangle for motion search. (slight compression improvement for high-motion scenes)

encoder/encoder.c: gcc < 3 compile fix

Change default level from 2.1 to 4.0 until I get around to calculating actual levels.

Clipping mvs to within picture + emulated border when running motion compensation.

Fix clipping of mvs in probe_pskip. (Previously it mixed up fullpel with qpel.) This should eliminate the black blocks that sometimes appeared in high motion, low detail scenes.

Fix length of strings stored in the registry.
Patch by Riccardo Stievano.

registry values for min/max keyint were mixed up

VfW: expose option "Nth pass" (i.e. simultaneously read and update the multipass stats file).
Patch by Riccardo Stievano.

add "make NDEBUG=1" to strip library

finish subpixel motion refinement for B-frames (up to 6% reduced size of B-frames at subq <= 3)

VfW: expose the 2pass ratecontrol option: qcomp ("bitrate variability").
Some rearranging of the advanced configuration dialogue.
Patch by Riccardo Stievano <walkunafraid at tin dot it>.

VfW: Support ip_factor and pb_factor, some cleanups.
patch by Riccardo Stievano <walkunafraid at tin dot it>

Use floats instead of int64 in log messages, since win32 (incl. mingw) doesn't understand %lld.
Also display MB statistics in percent instead of number.

finished printf -> x264_log conversion.

Don't apply keyframe boost to I-frames that are followed by another I.

New VfW option: "fast 1st pass" automatically disables some partitions and reduces ME quality and number of reference frames.
Removed option direct_pred=none, since it provides no benefits.
Patch by Riccardo Stievano <walkunafraid at tin dot it>.

vfw: tweak wording and defaults

From Riccardo Stievano <walkunafraid at tin dot it>:
here's a patch that fixes the VfW frontend after the changes made in
revision 93 (GOP size management). Default values for i_keyint_max
and i_keyint_min have been set to 250 and 10, respectively.

My last change of IDR decision broke in 2pass mode. fixed by remembering which frames are IDR.
Disable benchmarking, as it was very slow for some people, and we already know that all the time is spent in macroblock analysis.

Changes the mechanics of max keyframe interval:
Now enforces min and max GOP sizes, and allows variable numbers of
non-IDR I-frames within a GOP.

MinGW compatible resource.rc by Radek Czyz

strict QP offset for B-frame vs following P-frame
strict QP offset for I-frame vs GOP average

r72 broke B-frames without intra4x4. fixed.

updated VfW interface by Radek Czyz

improved mv prediction: 1-3% better compression of B-frames
early termination for B-frame ref search: up to 20% faster with lots of refs.

allow constant qp on Nth pass (e.g. for forcing frame types)

disable subme=0 (the huge bitrate penalty wasn't worth the speed)
renumber direct_pred

oops, last patch had some debug statements

fix: "x264 -A all" didn't include b8x8 types.
add: "make NDEBUG=1" to strip library
update TODO with B-frame status

Reorganize frame type selection: No longer produces consecutive I-frames when B-frames are enabled. Not thoroughly tested, but works for me.
Fix scenecut detection when B-frames are present: Can now produce IDR, but is slower since it re-encodes more frames. This might reduce compression ratio in the presence of quick fade-ins.
2pass ratecontrol deals more gracefully with completely skipped frames.

remove Makefile.cygwin because build/cygwin/Makefile is more up to date.
put correct object file names in .depend

reduce default verbosity, add option -v

remove relative include paths, to avoid conflicts with libtool

rename *.asm to avoid conflicts with libtool

list default settings in --help

replace EPZS diamond with a hexagon search pattern.
early termination for multiple reference frame search (up to 1.5x faster).

sps->i_num_ref_frames was set higher than necessary

new option: --fps

various cleanups in macroblock caching.
store motion data for each reference frame (but not yet used).

more accurate cost for psub8x8 modes.

implement macroblock types B_16x8, B_8x16
tweak thresholds for comparing B mb types

simplify x264_mb_predict_mv_direct16x16_temporal

option '--frames' limits number of frames to encode.
patch by Tuukka Toivonen <tuukkat at>

simplify calvc mb type

implement macroblock types B_SKIP, B_DIRECT, B_8x8

rename 'core/' to 'common/', which avoids conflicts with libtool

cleanup stats reporting
report B macroblock types
report average QP

apply ip_factor and pb_factor in constant quantiser encodes.

save a little bit of memory

multiple hypothesis mv prediction:
1-3% improved compression, and .5-1% faster

* analyse: we can do 4x4 Horizontal Up mode when LEFT is avaible.
improved 2pass ratecontrol:
ensures that I-frames have comparable quantizer to the following P-frames,
and produces more consistent quality in areas of fluctuating complexity.

more informative error message when 2pass fails to converge

#include <stdarg.h>

cleanup spacing of frame stats with verbose logging.

typo in x264_cabac_mb_sub_b_partition
(see ITU-T H.264 clause


+ No need to emulate memalign on OS X
+ Fixed Makefile for OS X
(Original patch by Peter Handel)

Conditionally inits 1pass rc, only if it's enabled.
This prevents a couple of irrelevant warnings from appearing in
constant QP mode. (Loren Merritt <lorenm at u dot washington dot edu>)

Oops, changing those types messed up some vprintf's. fixed.
(Loren Merrit <lorenm at u dot washington dot edu>)

filesize (bits) in a 32 bit int will overflow after 250MB, screwing up
2pass ratecontrol.
(patch by Loren Merritt <lorenm at u dot washington dot edu>)

fix compilation on FreeBSD (from Loren Merritt (thanks to Igla))

* ratecontrol: Patch by Loren Merritt :
" This patch
* calculates average QP as a float, providing slightly improved
ratecontrol if the first pass was CBR.
* fixes the reported QP if you set both b_stat_read and b_stat_write,
allowing 3 pass encoding (or just examination of the 2nd pass's stats)."

* all: Patch by Loren Merritt.
" This patch makes scene-cut detection based on the relative cost of I-frame
vs P-frame, rather than just on the number of I-blocks used.
It also makes the scene-cut threshold configurable.
This doesn't have a very large effect: Most scene cuts are obvious to
either algorithm. But I think this way is better in some less clear cut
cases, and sometimes finds a better spot for an I-frame than just waiting
for the max I-frame interval."

* ratecontrol: added 'b' flag to fopen.

* all: Patches by Loren Merritt:
"Improved patch. Now supports subpel ME on all candidate MB types,
not just on the winner.
subpel_refine: (completely different scale from before)
0 => halfpel only
1 => 1 iteration of qpel on the winner (same as x264 r46)
2 => 2 iterations of qpel (about the same as my earlier patch, but faster
3 => halfpel on all MB types, qpel on the winner
4 => qpel on all
5 => more iterations
mencoder dvd://1 -ovc x264 -x264encopts
subpel_refine=1:PSNR Global:46.82 kb/s:1048.1 fps:17.335
subpel_refine=2:PSNR Global:46.83 kb/s:1034.4 fps:16.970
subpel_refine=3:PSNR Global:46.84 kb/s:1023.3 fps:14.770
subpel_refine=4:PSNR Global:46.87 kb/s:1010.8 fps:11.598
subpel_refine=5:PSNR Global:46.88 kb/s:1006.9 fps:10.824"
"The current code for calculating the cost of encoding which reference
frame a MB is predicted from, introduces a bias towards ref0 and
against P16x16.
Removing this bias produces an improvement of .4% - 2% bitrate,
depending on content and number of reference frames."

* x264: added --ipratio --pbratio in help section.

* ratecontrol: path by Loren Merritt.
"Use average qp instead of last qp in the frame for 2pass rc.
(Improves quality and rate accuracy if the first pass was cbr.)"

* x264: added --quiet and --no-psnr.

* eval.c: lalala ;)

* added Loren Merritt.

* all: added eval.c (I hope libx264.dsp is correct, I can't test).

* all: 2pass patch by Loren Merritt <lorenm AT u.washington DOT edu>
"Mostly borrowed from libavcodec.
There is not much theoretical basis behind my choice of defaults for
rc_eq, qcompress, qblur, and ip_factor."

* all: first part of the 2pass patch by Loren Merritt
(only the header/textures bits computed for now).

* all: include stdarg.h (needed for x264_log)

Use x264_log() in ratecontrol.c

* encoder/encoder.c: oops. (fixed compilation).

* all: more fprintf -> x264_log.

* all: added a x264_param_t.analyse.b_psnr

* encoder/encoder.c: kb/s with k=1000 (more consistant). Patch by Loren
Merritt <lorenm AT u DOT washington DOT edu>

* all: introduced a x264_log function. It's not yet used everywhere
but we should start using it :)

OS X is missing exp2f()

Add my svn user name.


Include timing info in VUI.
Change frame rate from float to fraction (sorry for the inconvenience).

Add TAGS rule.

Fixes by Loren Merritt (lorenm at

Get rid of integer overflows that caused the rate control to go
haywire in some situations.

* encoder: correct range for i_idr_pic_id is 0..65535
(Not 0..65534)

ratecontrol: patch by Loren Merritt <lorenm AT u DOT washington DOT edu>
"The new cbr mode fails to completely disable itself when encoding in
constant QP mode. The per-block QPs are then randomized between QP+4 and
QP-2 based on uninitialized ratecontrol parameters."

* ratecontrol: patch by Måns Rullgård <mru AT mru DOT ath DOT cx>
"This patch fixes a small bug (divide by 0 possible) in the rate control."

* encoder: simpler scene cut detection (seems better but do not check
size anymore, so need more testing).

* all: Change the way PSNR is computed (based on a patch by Loren
Merritt <lorenmn AT u DOT washington DOT edu>
Using SQE(DeltaSourceReconstructed) = Sum( delta^2 )
PSNR( SQE, Size ) = -10Ln(SQE / 255^2 / Size )/Ln(10) )
Y+U+V : Union of YUV planes.
Now there is
- Mean PSNR : Sum( PSNR( SQE(Y/U/V), Size(Y/U/V) ) / TotalFrames
- Average PSNR: Sum( PSNR( SQE(Y+U+V), Size(Y+U+V) ) ) / TotalFrames
- Global PSNR: PSNR( Sum( SQE(Y+U+V) ), Size(Y+U+V)*TotalFrames )
Mean PSNR is used by the JM, and Average/Overall is used on Doom9 for

* x264.h: increased X264_BUILD.

* all: Patch from Måns Rullgård <mru AT mru DOT ath DOT cx>
"Here's a patch that adds some kind of rate control.I suppose it is
by no means perfect, but it's much better than constant quantizer.It
also has a very crude scene change detection that sometimes avoids a
buffer underflow by reencoding oversized P/B frames as I frames."

Linux PPC AltiVec fix

BeOS fixes (no stdint.h, no libm)

Attempt to fix build on Linux PPC

* encoder.c, analyse.c, macroblock: fixed when using a qp per MB.
(Buggy for pskip and mb with null cbp luma and chroma).
* dct*: fixed order of idct.

* cpu.asm: mmh trashing ebp,esi and edi isn't a good idea I fear ;)

* all: fixed ss2 runtime selection.

update & SSE2 support


remove some unused code

support for build checkasm.exe

* build fix (thx xxcd).

* TODO: test.

* vfw/* : oops...

* mc-c.c compilation fix for gcc >= 3.3

re-import of the CVS.
Hide changelog

Sections/Browse similar tools

Alternative to x264 Encoder

TMPGEnc Video Mastering Works

TotalCode Studio

Guides and How to's

x264 Encoding Options Explained - Read
View all guides with guide description here

Acronyms / Also Known As

x264 cli, x264cli

Notify me when software updated or report software

Email me when it has been updated    Report this tool (dead link/new version)  


Post comment
16 comments, Showing 1 to 16 comments


Posted by . Tool version 0.130.2273 using OS WinXP
Ease of use 9 of 10 Functionality 8 of 10 Value for money 10 of 10 Overall score 9 of 10

Brilliant software.

I use this commandline tool as part of a conversion sequence to turn TV captures into smaller .mp4s for later playback on a "WDTV Live" box.

The X264 commandline can be daunting to figure out initially (examples abound though, just search) but once you have a useful commandline then Bob's your Uncle. eg "my" commandline creates h264 video which is proven fully compatible with the WDTV Live in terms of the "video technical compliance stuff". Happy days.

X264, when combined with FFMPEG to convert audio and with MP4box to mux the video/audio into an .mp4, provides you with capability to create your own (repeatable) custom tailored encodes.


Posted by . Tool version 2044 using OS Windows 7 64-bit
Ease of use 4 of 10 Functionality 10 of 10 Value for money 10 of 10 Overall score 9 of 10

Extreme compression might be a very good feature for Sharing in-contra to my previous comment. Still figuring-out quality settings for personal back-up.

Other ripping tools like Xvid4PSP, StaxRip, RipBot264, FairUse Wizard, MEGUI must be updated to this version accordingly.

Posted by . Tool version Version:r1703 using OS Linux
Ease of use 9 of 10 Functionality 9 of 10 Value for money 9 of 10 Overall score 9 of 10

v r1703 better compression, but, video loses overall sharpness.
it's disappointing.
Target Video Bit rate : 1 500 Kbps
Actual Video Bit rate : 817 Kpbs (Too Low than target results-in poor Quality)

Hope for better improvement in next release.

Posted by . Tool version r1703 using OS Windows 7
Ease of use 9 of 10 Functionality 9 of 10 Value for money 9 of 10 Overall score 9 of 10

Simply the best implementation of H.264 spec.

It is a CLI tool so some patience is required to learn it, otherwise use some great GUI's like
Ripbot, StaxRip, or MeGUI.

Posted by . Tool version 1666 using OS Windows 7
Ease of use 5 of 10 Functionality 10 of 10 Value for money 10 of 10 Overall score 10 of 10

and my most favorite video encoder.

Thanks for continuous updates.

x264 vfw requires same trends for updates too!

Posted by . Tool version r1613 using OS WinXP
Ease of use 9 of 10 Functionality 9 of 10 Value for money 9 of 10 Overall score 9 of 10

Simply The Best H264 encoder available, no doubt.
Thanks to authors for keeping FREE,
and running a good show of updates.

Posted by . Tool version 1592 using OS WinXP
Ease of use 9 of 10 Functionality 9 of 10 Value for money 9 of 10 Overall score 9 of 10

By far the best H.264 encoder I've ever experienced. It even dwarfed all these commercial products and it's getting better!

Posted by . Tool version 1570 using OS WinXP
Ease of use 5 of 10 Functionality 10 of 10 Value for money 10 of 10 Overall score 10 of 10

Unless you device does not support h264 part 10 (AVC) then there is NO REASON why you should not be using x264, even if you are not a console God , there is plenty of GUI's that harness the power of this codec implmentation.

Posted by . Tool version 1354 using OS Windows 7 64-bit
Ease of use 10 of 10 Functionality 10 of 10 Value for money 10 of 10 Overall score 10 of 10

I have been an AVI with XviD and MP3 die hard fan for a LONG time! I just recently graduated to using MKV files and building my own chapters. THEN, I discovered that I can encode H264/X264 files and *directly* mux the AC3 audio from a DVD rip into an MKV.

What I did NOT expect was the quality of video as such low bit rates.

I very extensively use my Western Digital WD TV to play the videos I make on. When using XivD to encode 720 videos (1280X720), I *must* run at least 5000kbs to have a decent picture. With H264 (or better yet, x264.exe) the video quality is superior, at only 2500kbs!!!

Now I wish my Creative Zen would support 264, because it is leaps and bounds better than WMV9!!

I love this CLI tool, thanks so much to the author(s)!!! I hope to have a GUI built soon, and have plans on making a GUI tool kit for MKVs! Thanks!

Posted by . Tool version r1097 using OS WinXP
Ease of use 10 of 10 Functionality 10 of 10 Value for money 10 of 10 Overall score 10 of 10

With this codec,you have DVD-like picture quality on VCD bitrates!
IMO, the future is here, in this codec!

I use it with super encoder to batch convert dvb mpeg2 files of various framesizes. The speed is 1/4 realtime on my core 2 duo 6600.

The vfm version is faster (about 2/5 realtime). You can use it with virtualdub.

An excellent choice for those who like cutting edge solutions, or something with great future in front of it!

Posted by . Tool version cor54 rev600 using OS WinXP
Ease of use 10 of 10 Functionality 10 of 10 Value for money 10 of 10 Overall score 10 of 10

X264 is the best codec I ever used. Thanks to DeathToSheep for the unofficial VFW version I can stay using it with virtualdub.
I capture with Mainconcept PVR in MPEG2 (quality 32) and convert with VirtualdubMPG to AVI files (X264 -single pass bitrate 800)
With this combination of videotools I can put 13 episodes (50minutes/episode) of my favorite "Aspe murders" soap on 1 DVD and the quality is much,much better than VHS.

Posted by . Tool version 6.00 using OS WinXP
Ease of use 8 of 10 Functionality 10 of 10 Value for money 10 of 10 Overall score 9 of 10

After being skeptical about AVC H.264 I finally broke down and decided to try it for some iPod movies. The source files were MPEG-1 @ 1856kbps ripped from some high bitrate xVCD's I did years ago, I tried doing these with 3ivX and DivX 6.2.5 and wasn't pleased with the results especially on HDTV, I tried x264 using MeGUI and I am blown away by the 2-pass quality @ 700kbps. Even at a low resolution of 352x144 these movies look good (not great)on a 42" HDTV and the file sizes are quite small. MeGUI is a very powerful program but not exactly for noobs, When the ability for 640x480 iPod resolutions becomes possible this codec will be unstoppable!!

Posted by . Tool version 565 using OS WinXP
Ease of use 7 of 10 Functionality 9 of 10 Value for money 10 of 10 Overall score 9 of 10

What's xvid?? what's divx?? what's wmv??

No way X264 the best codec ever.

High quality in low bitrate.

This awesome codec use for me for months.No rival.


Posted by . Tool version 409 using OS WinXP
Ease of use 10 of 10 Functionality 10 of 10 Value for money 10 of 10 Overall score 10 of 10

Better quality than XVID and a smaller file size. Use the latest FFDshow from Celtic_Druid for playback. Be aware that playback is CPU intensive - Not designed for < 2.0Ghz machines (yet).

Posted by . Tool version 263 using OS WinXP
Ease of use 8 of 10 Functionality 9 of 10 Value for money 10 of 10 Overall score 9 of 10

works now correctly in sony vegas - testing quality against other H264, but so far - this is a winner ..

Posted by . Tool version Revision 261 using OS WinXP
Ease of use 8 of 10 Functionality 10 of 10 Value for money 10 of 10 Overall score 9 of 10

16 comments, Showing 1 to 16 comments

1 tool hits, Showing 1 to 1 tools
NEW SOFTWARE= New tool since your last visit
NEW VERSION= New version since your last visit
NEW COMMENT= New comment since your last visit

Type and download
NO MORE UPDATES? = The software hasn't been updated in over 2 years.
NO LONGER DEVELOPED = The software hasn't been updated in over 3 years.
RECENTLY UPDATED = The software has been updated the last 31 days.
Freeware = Free software.
Free software = Free software and also open source code.
Freeware/Ads = Free software but supported by advertising, usually with a included browser toolbar. It may be disabled when installing or after installation.
Free software/Ads = Free software and open source code but supported by advertising, usually with a included browser toolbar. It may be disabled when installing or after installation.
Trialware = Also called shareware or demo. Trial version available for download and testing with usually a time limit or limited functions.
Payware = No demo or trial available.
Portable version = A portable/standalone version is available. No installation is required.
v1.0.1 = Latest version available.
Download beta = It could be a BETA, RC(Release Candidate) and even a ALPHA version of the software.
Download (direct link) = A direct link to the software download.
Download (developer's site) = A link to the software developer site.
Download (mirror link) = A mirror link to the software download. It may not contain the latest versions.
Download old versions = Free downloads of previous versions of the program.
Download 64-bit version = If you have a 64bit operating system you can download this version.
Download portable version = Portable/Standalone version meaning that no installation is required, just extract the files to a folder and run directly.
Windows = Windows version available.
Mac OS = Mac OS version available.
Linux = Linux version available.
Our hosted tools are virus and malware scanned with several antivirus programs using

Rating from 0-10.

Browse software by sections
All In One Blu-ray Converters (12)
All In One DVD Converters (15)
All In One MKV to MP4/Blu-ray (10)
All In One Video Converters (28)
Animation (3D & 2D animation) (7)
Audio Editors (17)
Audio Encoders (68)
Audio Players (8)
Authoring (Blu-ray/AVCHD) (22)
Authoring (DivX) (5)
Authoring (DVD) (34)
Authoring (SVCD/VCD) (10)
Bitrate Calculators (7)
Blu-ray to AVI/MKV/MP4 (14)
Blu-ray to Blu-ray/AVCHD (10)
Burn (CD,DVD,Blu-ray) (22)
Camcorders/DV/HDV/AVCHD (34)
Capture (32)
CD/DVD recovery (3)
Codec Packs (7)
Codec/Video Identifiers (32)
Codecs (67)
Decrypters (Blu-ray Rippers) (8)
Decrypters (DVD Rippers) (15)
DigitalTV/DVB/HDTV (37)
DVD to AVI/DivX/XviD (18)
DVD to DVD (21)
DVD to MP4/MKV/H264 (18)
ISO/Image (16)
Linux video tools (106)
MacOS video tools (158)
Media (Blu-ray/DVD/CD) (9)
Media Center/HTPC/PS3/360 (41)
Other useful tools (114)
Photo Blu-ray/DVD/SlideShow (10)
Portable/Mobile/PSP/iPod (38)
Region free tools (5)
Screen capture/Screenshots (28)
Subtitle (58)
Video De/Multiplexers (54)
Video Editors (Advanced/NLE) (37)
Video Editors (Basic) (47)
Video Editors (H264/MP4/MKV/MTS) (18)
Video Editors (MPG/DVD) (20)
Video Editors (WMV/AVI) (15)
Video Encoders (AVI/WMV) (40)
Video Encoders (H264/H265/MP4/MKV) (46)
Video Encoders (MPG/DVD) (27)
Video Encoders / Converters (151)
Video Frameservers (8)
Video Players (34)
Video Repair/Fix (19)
Video Scripting (10)
Video Streaming (22)
Video Streaming Recording (56)
Virtualdub tools (10)
Search   Contact us   About   Advertise   Forum   RSS Feeds   Statistics   Software   

Site layout: Default Classic Blue