DOC2TXT(1) DOC2TXT(1)
NAME
doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings,
msexceltables - extract printable text from Microsoft
documents
SYNOPSIS
doc2txt [ file.doc ]
doc2ps [ file.doc ]
wdoc2txt [ file.doc ]
xls2txt [ file.xls ]
aux/olefs [ -m mtpt ] file.doc
aux/mswordstrings mtpt/WordDocument
aux/msexceltables [ -qaDnt ] [ -d delim ] [ -c column-range
] [ -w worksheet-range ] mtpt/Workbook
DESCRIPTION
Doc2txt is an rc(1) script that uses olefs and mswordstrings
to extract the printable text from the body of a Microsoft
Word document and write it on the standard output. Doc2ps
is similar, but emits PostScript corresponding to the docu-
ment. Wdoc2txt is similar to doc2txt, but uses plumb(1) to
send the output to a new acme(1) window instead. Xls2txt
performs a similar function for Microsoft Excel documents.
Microsoft Office documents are stored in OLE (Object Linking
and Embedding) format, which is a scaled down version of
Microsoft's FAT file system. Olefs presents the contents of
an MS Office document as a file system on mtpt, which
defaults to /mnt/doc. Mswordstrings or msexceltables may
then be used to parse the files inside, extracting a text
stream. Msexceltables may be given options to control the
formatting of its output.
-a Attempt conversion of non-tabular sheets in the
workbook (charts).
-d delim Sets the inter-field delimiter to the string
delim, by default a single space.
-D Enables debugging output.
-c range Range is a comma-separated list of column numbers
and ranges. Ranges are separated by dashes.
Limit processing to just those columns named; by
default all columns are output.
-n Disables field padding to column width.
-q Disable quoting of textural fields (see quote(2).)
-t Truncate fields to the column width.
-w range Range is a comma-separated list of worksheet num-
bers and ranges, this limits the sheets output
using the same syntax as the -c option above.
Suppressed chart pages are always included in the
sheet count.
Page 1 Plan 9 (printed 10/24/25)
DOC2TXT(1) DOC2TXT(1)
EXAMPLE
Extract pieces of an MS Excel spreadsheet.
aux/olefs report.xls
msexceltables -q -w 1,7,9-14 -c 3-5 -n -d '@' /mnt/doc/Workbook > rpt.txt
unmount /mnt/doc
SOURCE
/rc/bin doc2txt, doc2ps, wdoc2txt, and xls2txt
/sys/src/cmd/aux the others
SEE ALSO
strings(1)
``Microsoft Word 97 Binary File Format'', at Microsoft's
developer (MSDN) home page.
``LAOLA Binary Structures'',
http://user.cs.tu-berlin.de/~schwartz/pmh
``OpenOffice.Org's Excel Documentation'',
http://sc.openoffice.org/excelfileformat.pdf
Page 2 Plan 9 (printed 10/24/25)