New: More tapping points for debug image in ccextractor.
New: Add support for tesseract 4.0
Optimize: Remove multiple RGB to grey conversion in OCR.
Fix: Update UTF8Proc to 2.2.0
Fix: Update LibPNG to 1.6.35
Fix: Update Protobuf-c to 1.3.1
Fix: Warn instead of fatal when a 0xFF marker is missing
Fix: Segfault in general_loop.c due to null pointer dereference (case of no encoder)
Fix: Enable printing hdtv stats to console.
Fix: Many typos in comments and output messages
Fix: Ignore Visual Studio temporary project files
New: Add support for non-Latin characters in stdout
Fix: Check whether stream is empty
New: Add support for EIA-608 inside .mkv
New: Add support for DVB inside .mkv
Fix: Added -latrusmap Map Latin symbols to Cyrillic ones in special cases
of Russian Teletext files (issue #1086)
Fix: Several OCR crashes
New: Upgrade libGPAC to 0.7.1.
New: mp4 tx3g & multitrack subtitles.
New: Guide to update dependencies (docs/Updating_Dependencies.txt).
New: Add LICENSE File (#959).
New: Display quantisation mode in info box (#954).
New: Add instruction required to build ccextractor with HARDSUBX support (#946).
New: Added version no. of libraries to --version.
New: Added -quant (OCR quantization function).
New: Python API now compatible with Python 3.
Fix: linux/builddebug: Added non-local directories to the incluye search path so we don't
require a locally compiled tesseract or leptonica.
Fix: Correct -HARDSUBX Bug In CMake, allow build with hardsubx using cmake (#966).
Fix: possible segfaults in hardsubx_classifier.c due to strdup (#963).
Fix: Improve the start and end timestamps of extracted burned in captions (#962).
Fix: Update COMPILATION.md (#960).
Fix: Fixed crash with "-out=report" and "-out=null".
Fix: -nocf not working with OCR'ing (#958).
Fix: segfault in add_cc_sub_text and initialize to NULL in init_encoder (#950).
Fix: ccx_decoders_common.c: Copy data type when creating a copy of the subtitle structure.
Fix: Implicit declaration of these functions throws warning during build (#948).
Fix: ccx_decoders_common.c: Properly release allocated resources on free_subtitle().
Fix: Added a datatype member to struct cc_subtitle - needed so we can properly free all
memory when void *data points to a structure that has its own pointers.
Fix: dvb_subtitle_decoder.c: When combining image regions verify that the offset is
Fix: Updated traivis.yml to fix osx build (#947).
Fix: Add utf8proc src file to cmake, updated header file (#944).
Fix: Added required pointers on freep() calls.
Fix: Removed dvb_debug_traces_to_stdout and used the usual dbg_print instead.
Fix: Additional debug traces for DVB.
Fix: Fix minor memory leak in ocr.c.
Fix: Fix issue with displaying utf8proc version.
Fix: Fix failing cmake due to liblept/tesseract header files.
Fix: Added missing n in params.c.
Fix: builddebug: Use -fsanitize=address -fno-omit-frame-pointer.
Fix: ccx_decoders_common.c: Removed trivial memory leak.
Fix: ccx_encoders_srt.c: Made sure a pointer is non-NULL before dereferencing.
Fix: dvb_subtitle_decoder.c: Initialize pointer members to NULL when creating a structure.
Fix: lib_ccx.c: Initialize (memset 0) structure cc_subtitle after memory allocation.
Fix: Added verboseness to error/warnings in dvb_subtitle_decoder.c.
Fix: dvb_subtitle_decoder.c: Work on passing invalid streams errors upstream (plus some
warning messages) so we can eventually recover from this situation instead of crashing.
Fix: telxcc.c: Currently setting a colour doesn't necessarily add a space even though the
specifications mandate it. (#930).
Fix: dvb_subtitle_decoder.c: Fix null pointer derefence when region==NULL in write_dvb_sub.
Fix: DVB Teletext subtitle incomplete.
Fix: replace all 0xA characters within startbox with 0x20.
Fix: DVB Teletext subtitle incomplete (#922).
Fix: Add missing return value to one of the returns in process_tx3g().
Fix: Typos and other minor bugs.
Fix: Tidy CMakeLists & vcxproj (#920).
Fix: Added m2ts and -mxf to help screen.
Fix: Added MKV to demuxer_print_cfg.
Fix: Added MXF to demuxer_print_cfg.
Fix: "Out of order packets" error had wrong print() parameters.
Fix: Updated Python documentation.
Fix: Fix incorrect path in XML (#904).
Fix: linux build script (non-debug): Don't hide warnings from compiler.
Fix: linux build script (debug): Display what's step of the build script we're in.
Fix: Make the build reproducible (#976).
Fix: Remove instance of o1 and o2 from help.
Fix: Colors of DVB subtitles with depth 2 broken due to a missing break.
Fix: CEA-708: Caption loss due to CW command (#991).
Fix: CEA-708: Update patch for windows priority with functions (#990).
New: Preliminary MXF support
New: Added a histogram in one-minute increments of the number of lines in a subtitle.
New: Added Autoconf build scripts for CCExtractor to generate makefiles (mac).
New: Added Autoconf build scripts for CCExtractor to generate makefiles (linux).
New: Added .rpm package generation script.
New: Added build/installation script for .pkg.tar.xz (Arch Linux).
New: Added tarball generation script.
New: Added --analyzevideo. If present the video stream will be processed even if the
subtitles are in a different stream. This is useful when we want video information
(resolution, frame type, etc). -vides now implies this option too.
[Note: Tentative - some possibly breaking changed were made for this, so if you
use it validate results]
New: Added a GUI in the main CCExtractor binary (separate from the external GUIs
such as CCExtractorGUI).
New: A Python binding extension so it's possible to use CCExtractor's tools from
New: Added -nospupngocr (don't OCR bitmaps when generating spupng, faster)
New: Add support for file split on keyframe (-segmentonkeyonly)
New: Added WebVTT output from Matroska.
New: Support for source-specific multicast.
New: FreeType-based text renderer (-out=spupng with teletext/EIA608).
New: Upgrade library UTF8proc
New: Upgrade library win_iconv
New: Upgrade library zlib
New: Upgrade library LibPNG
New: Support for Source-Specific Multicast
New: Added Travis CI support
New: Made error messages clearer, less ambiguous
Fix: Prevent the OCR being initialized more than once (happened on multiprogram and
Fix: Makefiles, build scripts, etc... everything updated and corrected for all
-Fix: Proper line ending for .srt files from bitmaps.
Fix: OCR corrections using grayscale before extracting texts.
Fix: End timestamps in transcripts from DVB.
Fix: Forcing -noru to cause deduplication in ISDB
Fix: TS: Skip NULL packets
Fix: When NAL decoding fails, don't dump the whole decoded thing, limit to 160 bytes.
Fix: Modify Autoconf scripts to generate tarball for mac from /package_creators/tarball.sh
and include GUI files in tarball
Fix: Started work on libGPAC upgrade.
Fix: DVB subtitle not extracted if there's no display segment
Fix: Heap corruption in add_ocrtext2str
Fix: bug that caused -out=spupng sometimes crashes
Fix: Checks for text before newlines on DVB subtitles
Fix: OCR issue caused by separated dvb subtitle regions
Fix: DVB crash on specific condition (!rect->ocr_text)
Fix: DVB bug (Multiple-line subtitle; Missing last line)
Fix: --sentencecap for teletext samples
Fix: Crash when image passed into OCR is empty
Fix: Temporarily wrapped the Python API, not production ready yet
Fix: -delay option in DVB
- New: Added FFMPEG 3.0 to Windows build - last one that is XP compatible.
- New: Major improvements in CEA-608 to WebVTT (styles, etc).
- New: Return a non-zero return code if no subtitles are found.
- New: Windows build files updated to Visual Studio 2015, new target platform is 140_xp.
- New: Added basic support of Tesseract 4.0.0.
- New: Added build script for .deb.
- New: Updated -debugdvbsub parameter to get the most relevant DVB traces for debugging.
- New: SMPTE-TT files are now compatible with Adobe Premiere.
- New: Updated libpng.
- New: Added 3rd party (Tracy from archive.org) static linux build script.
- New: Add chapter extraction for MP4 files.
- New: Return code 10 if no captions are found at all.
- Fix: Teletext duplicate lines in certain cases.
- Fix: Improved teletext timing.
- Fix: DVB timing is finally good.
- Fix: A few minor memory leaks.
- Fix: tesseract library file included in mac build command.
- Fix: Bad WTV timings in some cases.
- Fix: Mac build script.
- Fix: Memory optimization in HARDSUBX edit_distance.
- Fix: SubStation Alpha subtitles in bitmap.
- Fix: lept msg severity in linux.
- Fix: SSA, SPUPNG and VTT timing and skipping of subtitles for SAMI and TTML.
- Fix: SMPTE-TT : Added support for font color.
- Fix: SAMI unnecessary empty subtitle when extracting DVB subs.
- Fix: Skip the packet if the adaptation field length is broken.
- Fix: 708 - lots of work done in the decoder. Implementation of more commands. Better timing.
- Fix: Signal handlers.
- New: In Windows, both with and without-OCR binaries are bundled, since the OCR one causes problems due to
dependencies in some system. So unless you need the OCR just use the non-OCR version.
- New: Added -sbs (sentence by sentence) for DVB output. Each frame in the output file contains a complete
- New: Added -curlposturl. If used each output frame will be sent with libcurl by doing a POST to that URL.
- Fix: More code consistency checking in function names.
- Fix: linux build script now tries to verify dependencies.
- Fix: Mac build script was missing a directory.
- Fix: Duplicate lines in mp4 (specifically affects itunes).
- Fix: Timing in .mp4, timing now calculated for each CC pair instead of per atom.
- Fix: Typos everywhere in the documentation and source code.
- Fix: CMakeLists for build in cmake.
- Fix: -unixts option.
- Fix: FPS switching messages.
- Fix: Removed ugly debug statement with local path in HardsubX.
- Fix: Changed platform target to v120_xp in Visual Studio (so XP is supported again).
- Fix: Added detail in many error messages.
- Fix: Memory leaks in videos with XDS.
- Fix: Makefile compatibility issues with Raspberry pi.
- Fix: missing separation between WebVTT header and body.
- Fix: Stupid bug in M2TS that preventing it from working.
- Fix: OCR libraries dependencies for the release version in Windows.
- Fix: non-buffered reading from pipes.
- Fix: --stream option with stdin.
- New: terminate_asap to buffered_read_opt
- New: Added some TV-show specific spelling dictionaries.
- New: Updated GPAC library.
- New: ASS/SSA.
- New: Capture sigterm to do some clean up before terminating.
- New: Work on 708: Changed DefineWindow behavior, only clear text of an existing window is style has changed.
- New: HardsubX - Burned in subtitle extraction subsystem.
- New: Color Detection in DVB Subtitles
- Fix: Corrected sentence capitalization
- Fix: Skipping redundant bytes at the end of tx3g atom in MP4
- Fix: Illegal SRT files being created from DVB subtitles
- Fix: Incorrect Progress Display
- New: --version parameter for extensive version information (version number, compile date, executable hash, git commit (if appropriate))
- New: Add -sem (semaphore) to create a .sem file when an output file is open and delete it when it's closed.
- New: Add --append parameter. This will prevent overwriting of exisiting files.
- New: File Rotation support added. The user has to send a USR1 signal to rotate.
- Fix: Issues with files <1 Mb
- Fix: Preview of generated transcript.
- Fix: Statistics were not generated anymore.
- Fix: Correcting display of sub mode and info in transcripts.
- Fix: Teletext page number displayed in -UCLA.
- Fix: Removal of excessive XDS notices about aspect ratio info.
- Fix: Force Flushing of file buffers works for all files now.
- Fix: mp4 void atoms that was causing some .mp4 files to fail.
- Fix: Memory usage caused by EPG processing was high due to many non-dynamic buffers.
- Fix: Project files for Visual Studio now include OCR support in Windows.
- Fix: "Premature end of file" (one of the scenarios)
- Fix: XDS data is always parsed again (needed to extract information such as program name)
- Fix: Teletext parsing: @ was incorrectly exported as * - X/26 packet specifications in ETS 300 706 v1.2.1 now better followed
- Fix: Teletext parsing: Latin G2 subsets and accented characters not displaying properly
- Fix: Timing in -ucla
- Fix: Timing in ISDB (some instances)
- Fix: "mfra" mp4 box weight changed to 1 (this helps with correct file format detection)
- Fix: Fix for TARGET File is null.
- Fix: Fixed SegFaults while parsing parameters (if mandatory parameter is not present in -outinterval, -codec or -nocodec)
- Fix: Crash when input small is too small
- Fix: Update some URLs in code (references to docs)
- Fix: -delay now updates final timestamp in ISDB, too
- Fix: Removed minor compiler warnings
- Fix: Visual Studio solution files working again
- Fix: ffmpeg integration working again
- New: Added --forceflush (-ff). If used, output file descriptors will be flushed immediately after being written to
- New: Hexdump XDS packets that we cannot parse (shouldn't be many of those anyway)
- New: If input file cannot be open, provide a decent human readable explanation
- New: GXF support
- Support for Grid Format (g608)
- Show Correct number of teletext packet processed
- Removed Segfault on incorrect mp4 detection
- Remove xml header from transcript format
- Help message updated for Teletext
- Added --help and -h for help message
- Added --nohtmlescape option
- Added --noscte20 option
- Support to extract Closed Caption from MultiProgram at once.
- CEA-708: exporting to SAMI (.smi), Transcript (.txt), Timed Transcript (ttxt) and SubRip (.srt).
- CEA-708: 16 bit charset support (tested on Korean).
- CEA-708: Roll Up captions handling.
- Changed TCP connection protocol (BIN data is now wrapped in packets, added EPG support and keep-alive packets).
- TCP connection password prompt is removed. To set connection password use -tcppassword argument instead.
- Support ISDB Closed Caption.
- Added a new output format, simplexml (used internally by a CCExtractor user, may or may not be useful for
- Fixed bug in capitalization code ('I' was not being capitalized).
- GUI should now run in Windows 8 (using the include .Net runtime, since 3.5 cannot be installed in Windows 8 apparently).
- Fixed Mac build script, binary is now compiled with support for files over 2 GB.
- Fixed bug in PMT code, damaged PMT sections could make CCExtractor crash.
- Added basic M2TS support
- Added EPG support - you can now export the Program Guide to XML
- Some bugfixes
- Fixed issue with teletext to other then srt.
- CCExtractor can be used as library if compiled using cmake
- By default the Windows version adds BOM to generated UTF files (this is because it's needed to open the files correctly) while all other builds don't add it (because it messes with text processing tools). You can use -bom and -nobom to change the behaviour.
- Fixed issue with -o1 -o2 and -12 parameters (where it would write output only in the o2 file)
- Fixed UCLA parameter issue. Now the UCLA parameter settings can't be overwritten anymore by later parameters that affect the custom transcript
- Switched order around for TLT and TT page number in custom transcript to match UCLA settings
- Added nobom parameter, for when files are processed by tools that can't handle the BOM. If using this, files might be not readable under windows.
- Segfault fix when no input files were given
- No more bin output when sending to server + possibility to send TT to server for processing
- Windows: Added the Microsoft redistributable MSVCR120.DLL to both the installation package and the application zip.
0.73 - GSOC
- Added support of BIN format for Teletext
- Added start of librarisation. This will allow in the future for other programs to use encoder/decoder functions and more.
0.72 - GSOC
- Fix for WTV files with incorrect timing
- Added support for fps change using data from AVC video track in a H264 TS file.
0.71 - GSOC
- Added feature to receive captions in BIN format according to CCExtractor's own protocol over TCP (-tcp port [-tcppassword password])
- Added ability to send captions to the server described above or to the online repository (-sendto host[:port])
- Added -stdin parameter for reading input stream from standard input
- Compilation in Cygwin using linux/Makefile
- Fix for .bin files when not using latin1 charset
- Correction of mp4 timing, when one timestamp points timing of two atom
0.70 - GSOC
This is the first release that is part of Google's Summer of Code.
Anshul, Ruslan and Willem joined CCExtractor to work on a number of things
over the summer, and their work is already reaching the mainstream
version of CCExtractor.
- Added a huge dictionary submitted by Matt Stockard.
- Added DVB subtitles decoder, spupng in output
- Added support for cdt2 media atoms in QT video files. Now multiple atoms in
a single sample sequence are supported.
- Changed Makefile.
- Fixed some bugs.
- Added feature to print info about file's subtitles and streams (-out=report).
- Support Long PMT.
- Support Configuration file.
- There is an sample configuration file in doc/ folder with name
- Just now only ccextractor.cnf named files kept beside ccextractor
executable is supported
- for details of which options can be set using configuration file,
please look at sample file.
- Added options for custom transcript output:
new parameter (-customtxt format), where the format must be like this: 1100100 (7 digits).
These indicate whether the next things should be displayed or not in the (timed) transcript:
- Display start time
- Display end time
- Display caption mode
- Display caption channel
- Use a relative timestamp ( relative to the sample)
- Display XDS info
- Use colors
0000101 is the default setting for transcripts
1110101 is the default for timed transcripts
1111001 is the default setting for -ucla
Make sure you use this parameter after others that might affect these
settings (-out, -ucla, -xds, -txt, -ttxt, ...)
- A few patches from Christopher Small, including proper support for multiple multicast clients listening on the same port.
- GUI: Fixed teletext preview.
- GUI: Added a small indicator of data being received when reading from UDP.
- GUI: Added UTF-8 support to preview Window (used for teletext).
- Fixes in Makefile and build script, compilation in linux and OSX failed if another libpng was found in the system.
- WTV support directly in CCExtractor (no need for wtvccdump any more).
- Started refactoring and clean-up.
- Fix: MPEG clock rollover (happens each 26 hours) caused a time discontinuity.
- Windows GUI: Started work on HDHomeRun support. For now it just looks for HDHomeRun devices. Lots of other things will arrive in the next versions.
- Windows GUI: Some code refactoring, since the HDHomeRun support makes the code larger enough to require more than one source file :-)
- A couple of shared variables between 608 decoders were causing problems when both fields were processed at the same time with -12, fixed.
- Added BOM for UTF-8 files.
- Corrected a few extended characters in the UTF-8 encoding, probably never used in real world captioning but since we got a good test sample file...
- Color and fonts in PAC commands were ignored, fixed (Helen Buus).
- Added a new output format, spupng. It consists on one .png file for each subtitle frame and one .xml with all the timing (Heleen Buus).
- Some fixes (Chris Small).
- Padding bytes were being discarded early in the process in 0.66, which is convenient for debugging, but it messes with timing in .raw, which depends on padding. Fixed.
- MythTV's branch had a fixed size buffer that could not be enough some times. Made dynamic.
- Better support for PAT changing mid stream.
- Removed quotes in Start in .smi (format fix).
- Added multicast support (Chris Small)
- Added ability to select IP address to bind in UDP (Chris Small)
- Fixes in -unixts and -delay for teletext.
- Added -autodash : When two people are talking, add a dash as needed (this is based on subtitle position). Only in .srt and with -trim. Quite experimental, feedback appreciated.
- Added -latin1 to select Latin 1 as encoding. Default is not UTF-8 (-utf8 still exists but it's not needed).
- Added -ru1, which emulates a (non-existing in real life) 1 line roll-up mode.
- Fixed bug in auto detection code that triggered a message about file being auto of sync.
- Added -investigate_packets The PMT is used to select the most promising elementary stream to get captions from. Sometimes captions are where you least expect it so -datapid allows you to select a elementary stream manually, in case the CC location is not obvious from the PMT contents. To assist looking for the right stream, the parameter "-investigate_packets" will have CCExtractor look inside each stream, looking for CC markers, and report the streams that are likely to contain CC data even if it can't be determined from their PMT entry.
- Added -datastreamtype to manually selecting a stream based on its type instead of its PID. Useful if your recording program always hides the caption under the stream stream type.
- Added -streamtype so if an elementary stream is selected manually for processing the streamtype can be selected too. This can be needed if you process for example a stream that is declared as "private MPEG" in the PMT, so CCExtractor can't tell what it is. Usually you'll want -streamtype 2 (MPEG video) or -streamtype 6 (MPEG private data).
- PMT content listing improved, it now shows the stream type for more types.
- Fixes in roll-up, cursor was being moved to column 1 if a RU2, RU3 or RU4 was received even if already in roll-up mode.
- Added -autoprogram. If a multiprogram TS is processed and -autoprogram is used CCExtractor will analyze all PMTs and use the first program that has a suitable data stream.
- Timed transcript (ttxt) now also exports the caption mode (roll-up, paint-on, etc) next to each line, as it's useful to detect things like commercials.
- Content Advisory information from XDS is now decoded if it's transmitted in "US TV parental guidelines" or "MPA". Other encoding such as Canada's are not supported yet due to lack of samples.
- Copy Management information from XDS is now decoded.
- Added -xds. If present and export format is timed transcript (only), XDS information will be saved to file (same file as the transcript, with XDS being clearly marked). Note that for now all XDS data is exported even if it doesn't change, so the transcript file will be significantly larger.
- Added some PaintOn support, at least enough to prevent it from breaking things when the other modes are used.
- Removed afd_data() warning. AFD doesn't carry any caption related data. AFD still detected in code in case we want to do something with it later anyway.
- Ported last changes from Petr Kutalek's telxcc. Current version is 2.4.4.
- In teletext mode when exporting to transcript (not .srt), an effort is made to detect and merge line duplicates. This is done by using the Levenshtein's distance, which is the number of changes requires to convert one string to another. To simplify things, strings are compared up to the length of the shortest one. There are 3 parameters that can be used to tweak the thresholds: -deblev: Enable debug so the calculated distance for each two strings is displayed. The output includes both strings, the calculated distance, the maximum allowed distance, and whether the strings are ultimately considered equivalent or not, i.e. the calculated distance is less or equal than the max allowed.
-levdistmincnt value: Minimum distance we always allow regardless of the length of the strings. Default 2. This means that if the calculated distance is 0, 1 or 2, we consider the strings to be equivalent.
-levdistmaxpct value: Maximum distance we allow, as a percentage of the shortest string length. Default 10%. For example, consider a comparison of one string of 30 characters and one of 60 characters. We want to determine whether the first 30 characters of the longer string are more or less the same as the shortest string, i.e. whether the longest string is the shortest one plus new characters and maybe some corrections. Since the shortest string is 30 characters and the default percentage is 10%, we would allow a distance of up to 3 between the first 30 characters.
- Added -lf : Use UNIX line terminator (LF) instead of Windows (CRLF).
- Added -noautotimeref: Prevent UTC reference from being auto set from the stream data.
- Minor GUI changes for teletext
- Added end timestamps in timed transcripts
- Added support for SMPTE (patch by John Kemp)
- Initial support for MPEG2 video tracks inside MP4 files (thanks a lot to GPAC's jean who assisted in analyzing the sample and doing the required changes in GPAC).
- Improved MP4 auto detection
- Support for PCR if PTS is not available (needed for some teletext samples, and probably useful for everything else).
- Support for UDP streaming - finally. Use "-udp $port" to have CCExtractor listen for a stream. I've only been able to test it with an European HDHomeRun, but it should work fine with any other tuner.
- Refactored PMT / PAT processing in transport streams, now allows to display their contents (-parsePAT and -parsePMT) which makes troubleshooting easier.
- Changed Window GUI size (larger).
- Added Teletext options to GUI.
- Added -teletext to force teletext mode even if not detected
- Added -noteletext to disable teletext detection. This can be needed for streams that have both 608 data and teletext packets if you need to process the 608 data (if teletext is detected it will take precedence otherwise).
- Added -datapid to force a specific elementary stream to be used for data (bypassing detections).
- Added -ru2 and -ru3 to limit the number of visible lines in roll-up captions (bypassing whatever the broadcast says).
- Added support for a .hex (hexadecimal) dump of data.
- Added support for wtv in Windows. This is done by using a new program (wtvccdump.exe) and a new DirectShow filter (CCExtractorDump.dll) that process the .wtv using DirecShow's filters and export the line 21 data to a .hex file. The GUI calls wtvccdump.exe as needed.
- Added --nogoptime to force PTS timing even when CCExtractor would use GOP timing otherwise.
- Telext support added, by integrating Petr Kutalek's telxcc. Integration is still quite basic (there's equivalent code from both CCExtractor and telxcc) and some clean up is needed, but it works. Petr has announced that he's abandoning telxcc so further development will happen directly in CCExtractor.
- Some bug fixes, as usual.
- Corrected Mac build "script" (needed to add GPAC included). Thanks to the Mac users that sent this.
- Hauppauge mode now uses PES timing, needed for files that don't have caption data during all the video (such as in commercial breaks).
- Added -mp4 and -in:mp4 to force the input to be processed as MP4.
- CC608 data embedded in a separate stream (as opposed as in the video stream itself) in MP4 files is now supported (not heavily tested). This should be rather useful since closed captioned files from iTunes use this format.
- More CEA-708 work. The debugger is now able to dump the "TV" contents for the first time. Also, a .srt can generated, however timing is not quite good yet (still need to figure out why).
- Added -svc (or --service) to select the CEA-708 services to be processed. For example, -svc 1,2 will process the primary and secondary language services. Valid values are 1-63, where 1 is the primary language, 2 is the secondary language (this is part of the specification) and 3-63 are provider defined.
- Rajesh Hingorani sent a fix for the MPEG decoder that fixes garbled output or certain samples (we had none like this in our test collection). Thanks, Rajesh.
- Add: MP4 support, using GPAC (a media library).
- Fix: The Windows version was writing text files with double r.
- Fix: Closed captions blocks with no data could cause a crash.
- Fix: -noru (to generate files without duplicate lines in roll-up) was broken, with complete lines being missing.
- Fix: bin format not working as input.
- More AVC/H.264 work. pic_order_cnt_type != 0 will be processed now.
- Fix: Roll-up captions with interruptions for Text (with ResumeTextDisplay in the middle of the caption data) were missing complete lines.
- Added a timed text transcript output format, probably only useful for roll-up captions. Use --timedtranscript or -ttxt. Output is like this:
00:01:25,485 | HOST: LAST NIGHT THE REPUBLICAN
00:01:29,522 | HOPEFULS INTRODUCE THEMSELVES TO
00:01:30,623 | PRIMARY VOTERS.
- XDS parser. Not complete (no point in dealing with V-Chip stuff for example), but enough to extract program and station information.
- Input streams can now come from standard input using - (just an hyphen) as parameter.
- Added a new output format called 'null' (use -null or -out=null). This format means "Don't produce any file", and is useful to have CCExtractor process the stream (for XDS messages, debugging, etc) without actually generating anything.
- Updated Windows GUI.
- Added -quiet => If used, CCExtractor will not write any message.
- Added -stdout => If used, the captions will be sent to stdout (console) instead of file. Combined with -, CCExtractor can work as a filter in a larger process, receiving the stream from stdin and sending the captions to stdout.
- Some code clean up, minor refactoring.
- Teletext detection (not yet processing).
- Implemented new PTS based mode to order the caption information of AVC/H.264 data streams. The old pic_order_cnt_lsb based method is still available via the -poc or --usepicorder command switches.
- Removed a couple of those annoying "Impossible!" error messages that appears when processing some (possibly broken, unsure) files.
- Added -nots --notypesettings to prevent italics and underline codes from being displayed.
- Note to those not liking the paragraph symbol being used for the music note: Submit a VALID replacement in latin-1.
- Added preliminary support for multiple program TS files. The parameter --program-number (or -pn) will let you choose which program number to process. If no number is passed and the TS file contains more than one, CCExtractor will display a list of found programs and terminate.
- Added support (basic, because I only received one sample) for some Hauppauge cards that save CC data in their own format. Use the parameter -haup to enable it (CCExtractor will display a notice if it thinks that it's processing a Hauppauge capture anyway).
- Fixed bug in roll-up.
- More AVC work, now TS files from echostar that provided garbled output are processed OK.
- Updated Windows GUI.
- Bugfixes in the Windows version. Some debug code was unintentionally left in the released version.
- H264 support
- Other minor changes a lot less important
- Replace pattern matching code with improved parser for MPEG-2 elementary streams.
- Fix parsing of ReplayTV 5000 captions.
- Add ability to decode SCTE 20 encoded captions.
- Make decoding of TS files more error tolerant.
- Start implementation of EIA-708 decoding (not active yet).
- Add -gt / --goptime switch to use GOP timing instead of PTS timing.
- Start implementation of AVC/H.264 decoding (not active yet).
- Fixed: The basic problem is that when 24fps movie film gets converted to 30fps NTSC they repeat every 4th frame. Some pics have 3 fields of CC data with field 3 CC data belongs to the same channel as field 1. The following pics have the fields reversed because of the odd number of fields. I used top_field_first to tell when the channels are reversed. See Table 6-1 of the SCTE 20 [Paul Fernquist]
- Add -nosync and -fullbin switches for debugging purposes.
- Remove -lg (--largegops) switch.
- Improve syncronization of captions for source files with jumps in their time information or gaps in the caption information.
- [R. Abarca] Changed Mac script, it now compiles/link everything from the /src directory.
- It's now possible to have CCExtractor add credits automatically.
- Added a feature to add start and end messages (for credits). See help screen for details.
- Force generated RCWT files to have the same length as source file.
- Fix documentation for -startat / -endat switches.
- Make -startat / -endat work with all output formats.
- Fix sync check for raw/rcwt files.
- Improve timing of dvr-ms NTSC captions.
- Add -in=bin switch to read CCExtractor's own binary format.
- Fix problem with short input files (smaller 1MB).
- Clean up regular and debug output.
- Add -out=bin switch to write RCWT data.
- Remove -bo/--bufferoutput switch and functionality.
- [Volker] Added new generic binary format (RCWT for Raw Captions With Time). This new format allows one file to contain all the available closed caption data instead of just one stream.
- Added --no_progress_bar to disable status information (mostly used when debugging, as the progress information is annoying in the middle of debug logs).
- The Windows GUI was reported to freeze in some conditions. Fixed.
- The Windows GUI is now targeted for .NET 2.0 instead of 3.5. This allows Windows 2000 to run it (there's not .NET 3.5 for Windows 2000), as requested by a couple of key users.
- Removed -autopad and -goppad, no longer needed.
- In preparation to a new binary format we have renamed the current .bin to .raw. Raw files have only CC data (with no header, timing, etc).
- The input file format (when forced) is now specified with -in=format such as -in=ts, -in=raw, -in=ps ... The old switches (-ts, -ps, etc) still work. The only exception is -bin which has been removed (reserved for the new binary format). Use -in=raw to process a raw file.
- Removed -d, which when produced a raw file used a DVD format. This has been merged into a new output type "dvdraw". So now instead of using -raw -d as before, use -out=dvdraw if you need this.
- Removed --noff
- Added gui_mode_reports for frontend communications, see related file.
- Windows GUI rewritten. Source code now included, too.
- [Volker] Dish Network clean-up
- [Volker] Fix in DVR-MS NTSC timing
- [Volker] More clean-up
- Minor fixes
- [Volker] Major MPEG parser rework. Code much cleaner now.
- Some stations transmit broken roll-up captions, and for some reason don't send CRs but RUs... Added work-around code to make captions readable.
- Started work on EIA-708 (DTV). Right now you can add -debug-708 to get a dump of the 708 data. An actually useful decoder will come soon.
- Some of the changes MIGHT HAVE BROKEN MythTV's code. I don't use MythTV myself so I rely on other people's samples and reports. If MythTV is broken please let me know.
- Added new debug options.
- [Volker] Added support for DVR-MS NTSC files.
- Other minor bugfixes and changes.
- Added support for live streaming, ccextractor can now process files that are being recorded at the same time.
- [Volker] Added a new DVR-MS loop - this is completely new, DVR-MS specific code, so we no longer use the generic MPEG code for DVR-MS. DVR-MS should (or will be eventually at least) be as reliable as TS. Note: For now, it's only ATSC recordings, not NTSC (analog) recordings.
- Added autodetection of DVR-MS files.
- Added -asf to force DVR-MS mode.
- Added some specific support for DVR-MS files. These format used to work correcty in 0.34 (pure luck) but the MPEG code rework broke it. It should work as it used to.
- Updated Windows GUI to support the new options.
- Added -lg --largegops From the help screen: Each Group-of-Picture comes with timing information. When this info is too separate (for example because there are a lot of frames in a GOP) ccextractor may prefer not to use GOP timing. Use this option is you need ccextractor to use GOP timing in large GOPs.
- Added an option to the GUI to process individual files in batch, i.e. call ccextractor once per file. Use it if you want to process several unrelated files in one go.
- Added an option to prevent duplicate lines in roll-up captions.
- Several minor bugfixes.
- Updated the GUI to add the new options.
- Fixed a bug in the read loop (no less) that caused some files to fail when reading without buffering (which is the default in the linux build).
- Several improvements in the GUI, such as saving current options as default.
- The option switch "-transcript" has been changed to "--transcript". Also, "-txt" has been added as the short alias.
- Windows GUI
- Updated help screen
- Default output is now .srt instead of .bin, use -raw if you need the data dump instead of .srt.
- Added -trim, which removes blank spaces at the left and rights of each line in .srt. Note that those spaces are there to help deaf people know if the person talking is at the left or the right of the screen, i.e. there aren't useless. But if they annoy you go ahead...
- Fixed a bug in the sanity check function that caused the Myth branch to abort.
- Fixed the OSX build script, it needed a new #define to work.
- Added a -transcript. If used, the output will have no time information. Also, if in roll-up mode there will be no repeated lines.
- Lots of changes in the MPEG parser, most of them submitted by Volker Quetschke.
- Fixed a bug in the CC decoder that could cause the first line not to be cleared in roll-up mode.
- ccextractor can now follow number sequences in file names, by suffixing the name with +. For example, DVD0001.VOB+
means DVD0001.VOB, DVD0002.VOB, etc. This works for all files, so part001.ts+ does what you could expect.
- Added -90090 which changes the clock frequency from the MPEG standard 90000 to 90090. It *could* (remains to be seen) help if there are timing issues.
- Better support for Tivo files.
- By default ccextractor now considers the whole input file list a one large file, instead of several, independent, video files. This has been changed because most programs (for example DVDDecrypt) just cut the files by size. If you need the old behaviour (because you actually edited the video files and want to join the subs), use -ve.
- Fixed bug in SMI, nbsp was missing a ;.
- Footer for SAMI files was incorrect (<body> and <sami> tags were being opened again instead of being closed).
- Displayed memory is now written to disk at end of stream even if there is no command requesting so (may prevent losing the last screenful).
- Important change that could break scripts, but that have been added because old behaviour was annoying to most people: _1 and _2 at the end of the output file names is now added ONLY if -12 is used (ie when there are two output files to produce). So
ccextractor -srt sopranos.mpg
now produces sopranos.srt instead of sopranos_1.srt. If you use -12, i.e.
ccextractor -srt -12 sopranos.mpg
sopranos_1.srt and sopranos_2.srt
- Added --defaultcolor to the help screen. Code was already in 0.34 but the documentation wasn't updated.
- Buffer is larger now, since I've found a sample where 256 Kb isn't enough for a PES (go figure).
- At the end of the process, a ratio between video length and time to process is displayed.
- Added some basic letter case and capitalization support. For captions that broadcast in ALL UPPERCASE (most of them), ccextractor can now do the first part of the job.
--sentencecap or -sc will tell ccextractor to follow the typical capitalization rules, such as capitalize months, days of week, etc.
So from YOU BETTER RESPECT THIS ROBE, ALAN
You better respect this robe, alan.
--capfile or -caf also enables the case processing part and adds an extra list of words in the specified file, for example:
where names.txt is just a plain text file with the proper spelling for some words, such as Alan Tony
So you get
You better respect this robe, Alan.
Which is the correct spelling. You can have a different spelling file per TV show, or a large file with a lot of words, etc.
- ccextractor has been reported to compile and run on Mac with a minor change in the build script, so I've created a mac directory with the modified script. I haven't tested it myself.
- Windows build comes with a File Version Number (0.0.0.34 in this version) in case you want to check for version info.
- Added -scr or --screenfuls, to select the number of screenfuls ccextractor should write before exiting. A screenful is a change of screen contents caused by a CC command (not new characters). In practice, this means that for .srt each group of lines is a screenful, except when using -dru (which produces a lot of groups of lines because each new character produces a new group).
- Completed tables for all encodings.
- Fixed bug in .srt related to milliseconds in time lines.
- Font colors are back for .srt (apparently some programs do support them after all). Use -nofc or --nofontcolor if you don't want these tags.
- Added -delay ms, which adds (or substracts) a number of milliseconds to all times in .srt/.sami files. For example, -delay 400
causes all subtitles to appear 400 ms later than they would normally do, and
causes all substitles to appear 400 ms before they would normally do.
- Added -startat at -endat which lets you select just a portion of data to be processed, such as from minute 3 to minute 5. Check help screen for exact syntax.
- Added -dru (direct rollup), which causes roll-up captions to be written as they would on TV instead of line by line. This makes .srt/.sami files a lot longer, and ugly too (each line is written many times, two characters at time).
- Fix in extended char decoding, I wasn't replacing the previous char.
- When a sequence code was found before having a PTS, reported time was undefined.
- Minor bugfix.
- Fixed a buffering related issue. Short version, the first 2 Mb in non-TS mode were being discarded.
- .srt no longer has <font> tags. No player seems to process them so my guess is that they are not part of the .srt "standard" even if McPoodle add them.
- Modified sanitizing code, it's less aggresive now. Ideally it should mean that characters won't be missed anymore. We'll see.
- Added -gp (or -goppad) to make ccextractor use GOP timing. Try it for non TS files where subs start OK but desync as the video advances.
- Format detection is not perfect yet. I've added -nomyth to prevent the MytvTV code path to be called. I've seen apparently correct files that make MythTV's MPEG decoder to choke. So, if it doesn't work correctly automatically: Try -nomyth and -myth. Hopefully one of the two options will work.
- Fixed a bug that caused dvr-ms (Windows Media Center) files to be incorrectly processed (letters out of order all the time).
- Reworked input buffer code, faster now.
- Completed MythTV's MPEG decoder for Program Streams, which results in better processing of some specific files.
- Automatic file format detection for all kind of files and closed caption storage method. No need to tell ccextractor anything about your file (but you still can).
- Added text mode handling into decoder, which gets rids of junk when text mode data is present.
- Added support for certain (possibly non standard compliant) DVDs that add more captions block in a user data block than they should (such as Red October).
- Fix in roll-up init code that caused the previous popup captions not to be written to disk.
- Other Minor bug fixes.
- Unicode should be decent now.
- Added support for Hauppauge PVR 250 cards, and (possibly) many others (bttv) with the same closed caption recording format. This is the result of hacking MythTV's MPEG parser into ccextractor. Integration is not very good (to put it midly) but it seems to work. Depending on the feedback I may continue working on this or just leave it 'as it' (good enough). If you want to process a file generated by one of these analog cards, use -myth. This is essential as it will make the program take a totally different code path.
- Added .SAMI generation. I'm sure this can be improved, though. If you have a good CSS for .SAMI files let me know.
- Work on Dish Network streams, timing was completely broken. It's fixed now at least for the samples I have, if it's not completely fixed let me know. Credit for this goes to Jack Ha who sent me a couple of samples and a first implementation of a semiworking fix.
- Added support for several input files (see help screen for details).
- Added Unicode and Latin-1 encoding.
- Extraction to .srt is almost complete - works correctly for pop-up and roll-up captions, possibly not yet for paint-on (mostly because I don't have any sample with paint-on captions so I can't test).
- Minor bug fixes.
- Automatic TS/non-TS mode detection.
- Work on handling special cases related to the MPEG reference clock: Roll over, jumps, etc.
- Modified padding code a bit: In particular, padding occurs on B-Frames now.
- Started work on CC data parsing (use -608 to see output).
- Added built-in input buffering.
- Major code reorganization.
- Added a decent progress indicator.
- Added TS header synchronization (so the input file no longer needs to start with a TS header).
- Minor bug fixes.
- Added MPEG reference clock parsing.
- Added autopadding in TS. Does miracles with timing.
- Added video information (as extracted from sequence header).
- Some code clean-up.
- FF sanity check enabled by default.