DOC2TXT(1) DOC2TXT(1) NAME doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables - extract printable text from Microsoft documents SYNOPSIS doc2txt [ file.doc ] doc2ps [ file.doc ] wdoc2txt [ file.doc ] xls2txt [ file.xls ] aux/olefs [ -m mtpt ] file.doc aux/mswordstrings mtpt/WordDocument aux/msexceltables [ -qaDnt ] [ -d delim ] [ -c column-range ] [ -w worksheet-range ] mtpt/Workbook DESCRIPTION Doc2txt is an rc(1) script that uses olefs and mswordstrings to extract the printable text from the body of a Microsoft Word document and write it on the standard output. Doc2ps is similar, but emits PostScript corresponding to the docu- ment. Wdoc2txt is similar to doc2txt, but uses plumb(1) to send the output to a new acme(1) window instead. Xls2txt performs a similar function for Microsoft Excel documents. Microsoft Office documents are stored in OLE (Object Linking and Embedding) format, which is a scaled down version of Microsoft's FAT file system. Olefs presents the contents of an MS Office document as a file system on mtpt, which defaults to /mnt/doc. Mswordstrings or msexceltables may then be used to parse the files inside, extracting a text stream. Msexceltables may be given options to control the formatting of its output. -a Attempt conversion of non-tabular sheets in the workbook (charts). -d delim Sets the inter-field delimiter to the string delim, by default a single space. -D Enables debugging output. -c range Range is a comma-separated list of column numbers and ranges. Ranges are separated by dashes. Limit processing to just those columns named; by default all columns are output. -n Disables field padding to column width. -q Disable quoting of textural fields (see quote(2).) -t Truncate fields to the column width. -w range Range is a comma-separated list of worksheet num- bers and ranges, this limits the sheets output using the same syntax as the -c option above. Suppressed chart pages are always included in the sheet count. Page 1 Plan 9 (printed 12/4/24) DOC2TXT(1) DOC2TXT(1) EXAMPLE Extract pieces of an MS Excel spreadsheet. aux/olefs report.xls msexceltables -q -w 1,7,9-14 -c 3-5 -n -d '@' /mnt/doc/Workbook > rpt.txt unmount /mnt/doc SOURCE /rc/bin doc2txt, doc2ps, wdoc2txt, and xls2txt /sys/src/cmd/aux the others SEE ALSO strings(1) ``Microsoft Word 97 Binary File Format'', at Microsoft's developer (MSDN) home page. ``LAOLA Binary Structures'', http://user.cs.tu-berlin.de/~schwartz/pmh ``OpenOffice.Org's Excel Documentation'', http://sc.openoffice.org/excelfileformat.pdf Page 2 Plan 9 (printed 12/4/24)