CCExtractor - closed captions extractor
ccextractor [options] inputfile1 [inputfile2...] [-o
outputfilename] [-o1 outputfilename1] [-o2 outputfilename2]
Extracts closed captions and teletext subtitles from video
streams. DVB, .TS, ReplayTV 4000 and 5000, dvr-ms, bttv, Tivo, Dish Network,
.mp4, HDHomeRun are known to work. It can do two things:
- Save the data to a "raw", unprocessed file which you
can later use
as input for other tools.
- Generate a subtitles file (.srt,.smi, or .txt) which you can
directly
use with your favourite player.
- inputfile: file(s)
to process
- -o outputfilename
- Use -o parameters to define output filename if you don't like the
default ones (same as infile plus _1 or _2 when needed and file extension,
e.g. .srt).
- -cf filename
- Write 'clean' data to a file. Cleans means the ES without TS or PES
headers.
- -stdout
- Write output to stdout (console) instead of file. If stdout is used, then
-o, -o1 and -o2 can't be used. Also -stdout
will redirect all messages to stderr (error).
- Dump the PES Header to stdout (console). This is used for debugging
purposes to see the contents of each PES packet header.
- -debugdvbsub
- Write the DVB subtitle debug traces to console.
- -ignoreptsjumps
- Ignore PTS jumps (default).
- -fixptsjumps
- fix pts jumps. Use this parameter if you experience timeline resets/jumps
in the output.
- -stdin
- Reads input from stdin (console) instead of file.
You can pass as many input files as you need. They will be
processed in order. If a file name is suffixed by +, ccextractor will try to
follow a numerical sequence. For example, DVD001.VOB+ means DVD001.VOB,
DVD002.VOB and so on until there are no more files. Output will be one
single file (either raw or srt). Use this if you made your recording in
several cuts (to skip commercials for example) but you want one subtitle
file with contiguous timing.
- -udp port
- Read the input via UDP (listening in the specified port) instead of
reading a file.
- -udp
[host:]port
- Read the input via UDP (listening in the specified port) instead of
reading a file. Host can be a hostname or IPv4 address. If host is not
specified then listens on the local host.
- -udp
[src@host:]port
- Read the input via UDP (listening in the specified port) instead of
reading a file. Host and src can be a hostname or IPv4 address. If host is
not specified then listens on the local host.
- -sendto
host[:port]
- Sends data in BIN format to the server according to the CCExtractor's
protocol over TCP. For IPv6 use [address]:port
- -tcp port
- Reads the input data in BIN format according to CCExtractor's protocol,
listening specified port on the local host
- -tcppassword
password
- Sets server password for new connections to tcp server
- -tcpdesc
description
- Sends to the server short description about captions e.g. channel name or
file name
- -1, -2, -12
- Output Field 1 data, Field 2 data, or both (DEFAULT is -1)
- --append
- To prevent overwriting of existing files. The output will be appended
instead.
- -cc2
- When in srt/sami mode, process captions in channel 2 instead of channel
1.
- -svc --service
N1[cs1],N2[cs2]...
- Enable CEA-708 (DTVCC) captions processing for the listed services. The
parameter is a comma delimited list of services numbers, such as
"1,2" to process the primary and secondary language services.
Pass "all" to process all services found.
- If captions in a service are stored in 16-bit encoding, you can specify
what charset or encoding was used. Pass its name after service number
(e.g. "1[EUC-KR],3" or "all[EUC-KR]") and it will
encode specified charset to UTF-8 using iconv. See iconv documentation to
check if required encoding/charset is supported.
- In general, if you want English subtitles you don't need to use these
options as they are broadcast in field 1, channel 1. If you want the
second language (usually Spanish) you may need to try -2, or
-cc2, or both.
- -out=format
- where format is one of
these:
- srt -> SubRip (default, so not actually needed).
- ass/ssa -> SubStation Alpha.
- webvtt -> WebVTT format
- webvtt-full -> WebVTT format with styling
- sami -> MS Synchronized Accesible Media Interface.
- bin -> CC data in CCExtractor's own binary format.
- raw -> CC data in McPoodle's Broadcast format.
- dvdraw -> CC data in McPoodle's DVD format.
- txt -> Transcript (no time codes, no roll-up captions, just the plain
transcription.
- ttxt -> Timed Transcript (transcription with time info)
- smptett -> SMPTE Timed Text (W3C TTML) format.
- spupng -> Set of .xml and .png files for use with dvdauthor's spumux.
See "Notes on spupng output format"
- null -> Don't produce any file output
- report -> Prints to stdout information about captions in specified
input. Don't produce any file output
- -gt
--goptime
- Use GOP for timing instead of PTS. This only applies to Program or
Transport Streams with MPEG2 data and overrides the default PTS timing.
GOP timing is always used for Elementary Streams.
- -nogt
--nogoptime
- Never use GOP timing (use PTS), even if ccextractor detects GOP timing is
the reasonable choice.
- -fp
--fixpadding
- Fix padding - some cards (or providers, or whatever) seem to send 0000 as
CC padding instead of 8080. If you get bad timing, this might solve
it.
- -90090
- Use 90090 (instead of 90000) as MPEG clock frequency. (reported to be
needed at least by Panasonic DMR-ES15 DVD Recorder)
- -ve
--videoedited
- By default, ccextractor will process input files in sequence as if they
were all one large file (i.e. split by a generic, non video-aware tool. If
you are processing video hat was split with a editing tool, use -ve
so ccextractor doesn't try to rebuild the original timing.
- -s --stream
[secs]
- Consider the file as a continuous stream that is growing as ccextractor
processes it, so don't try to figure out its size and don't terminate
processing when reaching the current end (i.e. wait for more data to
arrive). If the optional parameter secs is present, it means the number of
seconds without any new data after which ccextractor should exit. Use this
parameter if you want to process a live stream but not kill ccextractor
externally. Note: If -s is used then only one input file is
allowed.
- -poc
--usepicorder
- Use the pic_order_cnt_lsb in AVC/H.264 data streams to order the CC
information. The default way is to use the PTS information. Use this
switch only when needed.
- -myth
- Force MythTV code branch.
- -nomyth
- Disable MythTV code branch. The MythTV branch is needed for analog
captures where the closed caption data is stored in the VBI, such as those
with bttv cards (Hauppage 250 for example). This is detected automatically
so you don't need to worry about this unless autodetection doesn't work
for you.
- -wtvconvertfix
- This switch works around a bug in Windows 7's built in software to convert
*.wtv to *.dvr-ms. For analog NTSC recordings the CC information is marked
as digital captions. Use this switch only when needed.
- -wtvmpeg2
- Read the captions from the MPEG2 video stream rather than the captions
stream in WTV files
- -pn
--program-number
- In TS mode, specifically select a program to process. Not needed if the TS
only has one. If this parameter is not specified and CCExtractor detects
more than one program in the input, it will list the programs found and
terminate without doing anything, unless -autoprogram (see below)
is used.
- -autoprogram
- If there's more than one program in the stream, just use the first one we
find that contains a suitable stream.
- -datapid
- Don't try to find out the stream for caption/teletext data, just use this
one instead.
- -datastreamtype
- Instead of selecting the stream by its PID, select it by its type (pick
the stream that has this type in the PMT)
- -streamtype
- Assume the data is of this type, don't autodetect. This parameter may be
needed if -datapid or -datastreamtype is used and
CCExtractor cannot determine how to process the stream. The value will
usually be 2 (MPEG video) or 6 (MPEG private data).
- -haup
--hauppauge
- If the video was recorder using a Hauppauge card, it might need special
processing. This parameter will force the special treatment.
- -mp4vidtrack
- In MP4 files the closed caption data can be embedded in the video track or
in a dedicated CC track. If a dedicated track is detected it will be
processed instead of the video track. If you need to force the video track
to be processed instead use this option.
- -noautotimeref
- Some streams come with broadcast date information. When such data is
available, CCExtractor will set its time reference to the received data.
Use this parameter if you prefer your own reference. Note: Current this
only affects Teletext in timed transcript with -datets.
- --noscte20
- Ignore SCTE-20 data if present.
- --webvtt-create-css
- Create a separate file for CSS instead of inline.
- -deblev
- Enable debug so the calculated distance for each two strings is displayed.
The output includes both strings, the calculated distance, the maximum
allowed distance, and whether the strings are ultimately considered
equivalent or not, i.e. the calculated distance is less or equal than the
max allowed..
- -anvid
--analyzevideo
- Analyze the video stream even if it's not used for subtitles. This allows
one to provide video information.
- When processing teletext files CCExtractor tries to correct typos by
comparing consecutive lines. If line N+1 is almost identical to line N
except for minor changes (plus next characters) then it assumes that line
N that a typo that was corrected in N+1. This is currently implemented in
teletext because it's where samples files that could benefit from this
were available. You can adjust, or disable, the algorithm settings with
the following parameters.
- -nolevdist
- Don't attempt to correct typos with Levenshtein distance.
- -levdistmincnt
value
- Minimum distance we always allow regardless of the length of the
strings.Default 2. This means that if the calculated distance is 0,1 or 2,
we consider the strings to be equivalent.
- -levdistmaxpct
value
- Maximum distance we allow, as a percentage of the shortest string length.
Default 10%.0 For example, consider a comparison of one string of 30
characters and one of 60 characters. We want to determine whether the
first 30 characters of the longer string are more or less the same as the
shortest string, i.e. whether the longest string is the shortest one plus
new characters and maybe some corrections. Since the shortest string is 30
characters and the default percentage is 10%, we would allow a distance of
up to 3 between the first 30 characters.
- -chapters
(Experimental)
- Produces a chapter file from MP4 files. Note that this must only be used
with MP4 files, for other files it will simply generate subtitles
file.
- -bom
- Append a BOM (Byte Order Mark) to output files. Note that most text
processing tools in linux will not like BOM. This is the default in
Windows builds.
- -nobom
- Do not append a BOM (Byte Order Mark) to output files. Note that this may
break files when using Windows. This is the default in non-Windows
builds.
- -unicode
- Encode subtitles in Unicode instead of Latin-1.
- -utf8
- Encode subtitles in UTF-8 (no longer needed. because UTF-8 is now the
default).
- -latin1
- Encode subtitles in Latin-1
- -nofc
--nofontcolor
- For .srt/.sami/.vtt, don't add font color tags.
- --nohtmlescape
- For .srt/.sami/.vtt, don't covert html unsafe character
- -nots
--notypesetting
- For .srt/.sami/.vtt, don't add typesetting tags.
- -trim
- Trim lines.
- -dc
--defaultcolor
- Select a different default color (instead of white). This causes all
output in .srt/.smi/.vtt files to have a font tag, which makes the files
larger. Add the color you want in RGB, such as -dc #FF0000 for
red.
- -sc
--sentencecap
- Sentence capitalization. Use if you hate ALL CAPS in subtitles.
- -sbs
--splitbysentence
- Split output text so each frame contains a complete sentence. Timings are
adjusted based on number of characters.
- --capfile
-caf file
- Add the contents of 'file' to the list of words that must be capitalized.
For example, if file is a plain text file that contains
- Tony
Alan
- Whenever those words are found they will be written exactly as they appear
in the file. Use one line per word. Lines starting with # are considered
comments and discarded.
- -unixts
REF
- For timed transcripts that have an absolute date instead of a timestamp
relative to the file start), use this time reference (UNIX timestamp). 0
=> Use current system time. ccextractor will automatically switch to
transport stream UTC timestamps when available.
- -datets
- In transcripts, write time as YYYYMMDDHHMMss,ms.
- -sects
- In transcripts, write time as ss,ms
- -UCLA
- Transcripts are generated with a specific format that is convenient for a
specific project, feel free to play with it but be aware that this format
is really live - don't rely on its output format not changing between
versions.
- -lf
- Use LF (UNIX) instead of CRLF (DOS, Windows) as line terminator.
- -autodash
- Based on position on screen, attempt to determine the different speakers
and a dash (-) when each of them talks (.srt/.vtt only, -trim
required).
- -xmltv mode
- produce an XMLTV file containing the EPG data from the source TS file.
Mode: 1 = full output 2 = live output. 3 = both
- -sem
- Create a .sem file for each output file that is open and delete it on file
close.
- -dvbcolor
- For DVB subtitles, also output the color of the subtitles, if the output
format is SRT or WebVTT.
- -nodvbcolor
- In DVB subtitles, disable color in output.
- -dvblang
- For DVB subtitles, select which language's caption stream will be
processed. e.g. 'eng' for English. If there are multiple languages, only
this specified language stream will be processed (default).
- -ocrlang
- Manually select the name of the Tesseract .traineddata file. Helpful if
you want to OCR a caption stream of one language with the data of another
language. e.g. '-dvblang chs -ocrlang chi_tra' will decode the
Chinese (Simplified) caption stream but perform OCR using the Chinese
(Traditional) trained data This option is also helpful when the
traineddata file has non standard names that don't follow ISO specs
- -oem
- Select the OEM mode for Tesseract, could be 0, 1 or 2. 0:
OEM_TESSERACT_ONLY - default value, the fastest mode. 1: OEM_LSTM_ONLY -
use LSTM algorithm for recognition. 2: OEM_TESSERACT_LSTM_COMBINED - both
algorithms.
- -mkvlang
- For MKV subtitles, select which language's caption stream will be
processed. e.g. 'eng' for English. Language codes can be either the 3
letters bibliographic ISO-639-2 form (like "fre" for french) or
a language code followed by a dash and a country code for specialities in
languages (like "fre-ca" for Canadian French).
- -nospupngocr
- When processing DVB don't use the OCR to write the text as comments in the
XML file.
- -font
- Specify the full path of the font that is to be used when generating
SPUPNG files.
- -bi
--bufferinput
- Forces input buffering.
- -nobi
-nobufferinput
- Disables input buffering.
- -bs --buffersize
val
- Specify a size for reading, in bytes (suffix with K or or M for kilobytes
and megabytes). Default is 16M.
- -koc
- keep-output-close. If used then CCExtractor will close the output file
after writing each subtitle frame and attempt to create it again when
needed.
- -ff
--forceflush
- Flush the file buffer whenever content is written.
- -dru
- Direct Roll-Up. When in roll-up mode, write character by character instead
of line by line. Note that this produces (much) larger files.
- -noru
--norollup
- If you hate the repeated lines caused by the roll-up emulation, you can
have ccextractor write only one line at a time, getting rid of these
repeated lines.
- -ru1 / ru2 / ru3
- roll-up captions can consist of 2, 3 or 4 visible lines at any time (the
number of lines is part of the transmission). If having 3 or 4 lines
annoys you you can use -ru to force the decoder to always use 1, 2
or 3 lines. Note that 1 line is not a real mode rollup mode, so
CCExtractor does what it can. In -ru1 the start timestamp is
actually the timestamp of the first character received which is possibly
more accurate.
- -delay ms
- For srt/sami/webvtt, add this number of milliseconds to all times. For
example, -delay 400 makes subtitles appear 400ms late. You can also
use negative numbers to make subs appear early.
Notes on times: -startat and -endat times are used
first, then -delay. So if you use -srt -startat 3:00
-endat 5:00 -delay 120000, ccextractor will generate a .srt
file, with only data from 3:00 to 5:00 in the input file(s) and then add
that (huge) delay, which would make the final file start at 5:00 and end at
7:00.
- -startat
time
- Only write caption information that starts after the given time. Time can
be seconds, MM:SS or HH:MM:SS. For example, -startat 3:00 means
'start writing from minute 3.
- -endat time
- Stop processing after the given time (same format as -startat). The
-startat and -endat options are honored in all output
formats. In all formats with timing information the times are
unchanged.
- -scr --screenfuls
num
- Write 'num' screenfuls and terminate processing.
If codec type is not selected then first elementary stream
suitable for subtitle is selected, please consider -teletext
-noteletext override this option.
- -codec
dvbsub
- select the dvb subtitle from all elementary stream, if stream of dvb
subtitle type is not found then nothing is selected and no subtitle is
generated
- -nocodec
dvbsub
- ignore dvb subtitle and follow default behaviour
- -codec
teletext
- select the teletext subtitle from elementary stream
-nocodec teletext
- ignore teletext subtitle
- NOTE: option given in form -foo=bar ,-foo = bar and
--foo=bar are invalid. Valid option are only in form
-foo bar. nocodec and codec parameter must not be same. If found to
be same then parameter of nocodec is ignored, this flag should be passed
once, more then one are not supported yet and last parameter would taken
in consideration
CCExtractor can _try_ to add a custom message (for credits for
example) at the start and end of the file, looking for a window where there
are no captions. If there is no such window, then no text will be added. The
start window must be between the times given and must have enough time to
display the message for at least the specified time.
- --startcreditstext
txt
- Write this text as start credits. If there are several lines, separate
them with the characters \n, for example Line1\nLine 2.
- --startcreditsnotbefore
time
- Don't display the start credits before this time (S, or MM:SS). Default:
0
- --startcreditsnotafter
time
- Don't display the start credits after this time (S, or MM:SS). Default:
5:00
- --startcreditsforatleast
time
- Start credits need to be displayed for at least this time (S, or MM:SS).
Default: 2
- --startcreditsforatmost
time
- Start credits should be displayed for at most this time (S, or MM:SS).
Default: 5
- --endcreditstext
txt
- Write this text as end credits. If there are several lines, separate them
with the characters \n, for example Line1\nLine 2.
- --endcreditsforatleast
time
- End credits need to be displayed for at least this time (S, or MM:SS).
Default: 2
- --endcreditsforatmost
time
- End credits should be displayed for at most this time (S, or MM:SS).
Default: 5
- -debug
- Show lots of debugging output.
- -608
- Print debug traces from the EIA-608 decoder. If you need to submit a bug
report, please send the output from this option.
- -708
- Print debug information from the (currently in development) EIA-708 (DTV)
decoder.
- -goppts
- Enable lots of time stamp output.
- -xdsdebug
- Enable XDS debug data (lots of it).
- -vides
- Print debug info about the analysed elementary video stream.
- -cbraw
- Print debug trace with the raw 608/708 data with time stamps.
- -nosync
- Disable the syncing code. Only useful for debugging purposes.
- -fullbin
- Disable the removal of trailing padding blocks when exporting to bin
format. Only useful for for debugging purposes.
- -parsedebug
- Print debug info about the parsed container file. (Only for TS/ASF files
at the moment.)
- -parsePAT
- Print Program Association Table dump.
- -parsePMT
- Print Program Map Table dump.
- -dumpdef
- Hex-dump defective TS packets.
- -investigate_packets
- If no CC packets are detected based on the PMT, try to find data in all
packets by scanning.
- -tpage page
- Use this page for subtitles (if this parameter is not used, try to
autodetect). In Spain the page is always 888, may vary in other
countries.
- -tverbose
- Enable verbose mode in the teletext decoder.
- -teletext
- Force teletext mode even if teletext is not detected. If used, you should
also pass -datapid to specify the stream ID you want to
process.
- -noteletext
- Disable teletext processing. This might be needed for video streams that
have both teletext packets and CEA-608/708 packets (if teletext is
processed then CEA-608/708 processing is disabled).
- -customtxt
format
- Use the passed format to customize the (Timed) Transcript output. The
format must be like this: 1100100 (7 digits). These indicate whether the
next things should be displayed or not in the (timed) transcript. They
represent (in order):
- — Display start time
- — Display end time
- — Display caption mode
- — Display caption channel
- — Use a relative timestamp ( relative to the sample)
- — Display XDS info
- — Use colors
- Examples:
- 0000101 is the default setting for transcripts
1110101 is the default for timed transcripts
1111001 is the default setting for -ucla
- Make sure you use this parameter after others that might affect these
settings (-out, -ucla, -xds, -txt,
-ttxt ...)
- --gui_mode_reports
- Report progress and interesting events to stderr in a easy to parse
format. This is intended to be used by other programs. See docs directory
for details.
- --no_progress_bar
- Suppress the output of the progress bar
- -quiet
- Don't write any message.
Notes on the CEA-708 decoder: While it is starting to be useful,
it's a work in progress. A number of things don't work yet in the decoder
itself, and many of the auxiliary tools (case conversion to name one) won't
do anything yet. Feel free to submit samples that cause problems and feature
requests.
Notes on spupng output format: One .xml file is created per output
field. A set of .png files are created in a directory with the same base
name as the corresponding .xml file(s), but with a .d extension. Each .png
file will contain an image representing one caption and named subNNNN.png,
starting with sub0000.png.
For example, the command:
- ccextractor -out=spupng input.mpg
will create the files:
- input.xml
input.d/sub0000.png
input.d/sub0001.png
...
The command:
- ccextractor -out=spupng -o /tmp/output -12 input.mpg
will create the files:
- /tmp/output_1.xml
/tmp/output_1.d/sub0000.png
/tmp/output_1.d/sub0001.png
...
/tmp/output_2.xml
/tmp/output_2.d/sub0000.png
/tmp/output_2.d/sub0001.png
...
- -hardsubx
- Enable the burned-in subtitle extraction subsystem.
NOTE: The following options will work only if -hardsubx is specified
before them:-
- -ocr_mode
- Set the OCR mode to either frame-wise, word-wise or letter wise.
e.g. -ocr_mode frame (default), -ocr_mode word,
-ocr_mode letter
- -subcolor
- Specify the color of the subtitles
Possible values are in the set {white,yellow,green,cyan,blue,magenta,red}.
Alternatively, a custom hue value between 1 and 360 may also be specified.
e.g. -subcolor white or -subcolor 270 (for violet).
Refer to an HSV color chart for values.
- -min_sub_duration
- Specify the minimum duration that a subtitle line must exist on the
screen.
The value is specified in seconds.
A lower value gives better results, but takes more processing time.
The recommended value is 0.5 (default).
e.g. -min_sub_duration 1.0 (for a duration of 1 second)
- -detect_italics
- Specify whether italics are to be detected from the OCR text.
Italic detection automatically enforces the OCR mode to be word-wise
- -conf_thresh
- Specify the classifier confidence threshold between 1 and 100.
Try and use a threshold which works for you if you get a lot of garbage
text.
e.g. -conf_thresh 50
- -whiteness_thresh
- For white subtitles only, specify the luminance threshold between 1 and
100.
This threshold is content dependent, and adjusting values may give you
better results
Recommended values are in the range 80 to 100.
The default value is 95
- An example command for burned-in subtitle extraction is as follows:
- ccextractor video.mp4 -hardsubx -subcolor white -detect_italics
-whiteness_thresh 90 -conf_thresh 60
- --version
- Display current CCExtractor version and detailed information.