Converting a Word document to text using IronPython
The module below demonstrates how to convert a batch of Word documents to text.
If calling from a command line, you can pass the path of the files to convert as an argument. Or, call the module without a argument and it will use what you have defined as test_path.
The doc_to_text method does the actual work of converting an individual Word document to text. Using COM Interop, it opens the Word document, loops through the paragraphs and returns the paragraph text. The text is passed to clean_text to perform any text cleansing. Internally, Word documents use carriage returns (CR), which I replace as carriage returns plus line feeds (CR+LF). Page breaks are represented by the form feed (FF) character. I'm not sure what the BEL character is used for, however it was prevalent in my Word documents – so I replaced them with an empty string.
The convert_files method gets a list of all of the Word documents in a directory, loops through that list converting each file, and saves the result as a text file.
__author__ = "Edward J. Stembler" __date__ = "2009-01-09" __module_name__ = "Converts a batch of Word documents, found in a directory, to text" __version__ = "1.0" version_info = (1,0,0) import sys import clr import System from System.Text import StringBuilder from System.IO import DirectoryInfo, File, FileInfo, Path, StreamWriter clr.AddReference("Microsoft.Office.Interop.Word") import Microsoft.Office.Interop.Word as Word def convert_files(doc_path): directory = DirectoryInfo(doc_path) files = directory.GetFiles("*.doc") for file_info in files: text = doc_to_text(Path.Combine(doc_path, file_info.Name)) stream_writer = File.CreateText(Path.GetFileNameWithoutExtension(file_info.Name) + ".txt") stream_writer.Write(text) stream_writer.Close() return def doc_to_text(filename): word_application = Word.ApplicationClass() word_application.visible = False document = word_application.Documents.Open(filename) result = StringBuilder() for p in document.Paragraphs: result.Append(clean_text(p.Range.Text)) document.Close() document = None word_application.Quit() word_application = None return result.ToString() def clean_text(text): text = text.replace("\12", "") # FF text = text.replace("\07", "") # BEL text = text.replace("\r", "\r\n") # CR -> CRLF return text test_path = "C:\\test\\" if __name__ == "__main__": if len(sys.argv) == 2: convert_files(sys.argv) else: convert_files(test_path)