convert word doc to html programmatically in java

convert word doc to html programmatically in java  using -'java,html,ms-word'

I need to convert a Word document into HTML file(s) in Java. The function will take input an word document and the output will be html file(s) based on the number of pages the word document has i.e. if the word document has 3 pages then there will be 3 html files generated having the required page break.

I searched for open source/non-commercial APIs which can convert doc to html but for no result. Anybody who have done this type of job before please help.


asked Sep 14, 2015 by MinaAxfordvz
0 votes

11 Answers

0 votes

We use tm-extractors (, and fall back to the commercial Aspose ( Both have native Java APIs.

answered Sep 14, 2015 by ShaniceRinte
0 votes

I recommend the JODConverter, It leverages, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today.

JODConverter has a lot of documents, scripts, and tutorials to help you out.

answered Sep 14, 2015 by amit.clavax
0 votes

I've used the following approach successfully in production systems where the new MS Word XML format isn't available:

Spawn a process that does something similar to:

You'd probably want to start openoffice up once at startup of your program, and call the python script as many times during your program that you need to (with some sort of checking to ensure the ooffice process is always there).

The other option is to spawn the following sort of command every time you need to do the conversion:

ooffice -headless "macro://"

I've used the macro approach multiple times and it works well (sorry, I don't have the macro code available).

While there are mechanisms for doing it via MS Word, they're not easy from Java, and do require other support programs to drive MS Word via OLE.

I've used abiword before too, which works well for many documents, but does get confused with more complex documents (ooffice seems to handle everything I've thrown at it). Abiword has a slightly easier command line interface for conversion than ooffice.

answered Sep 14, 2015 by AntAmq
0 votes

It is easier to do this in the new MS word docx as the format is in XML. You can use an XSL to transform the Word doc in XML format to an HTML format.

If however your Word doc is in an old version, you can use POI library and then access that and generate a Java object and from that point on you can easily convert it to an HTML format using an HTML java library

answered Sep 14, 2015 by AugustYJBSbn
0 votes

Here are some starting points for you. Good luck.

On Microsoft's website, you can find documentation for the .doc format, and on the ECMA website, the .docx format. Microsoft has a category for Java on their OpenXML developer blog, including a post specifically about converting OpenXML to XHTML in Java.

answered Sep 14, 2015 by CassandraLam
0 votes

If its a docx, you could use docx4j (ASL v2). This uses XSLT to create the HTML.

However, it will give you a single HTML for the whole document.

If you wanted an HTML per page, you could do something with the lastRenderedPageBreak tag that Word puts into the docx (assuming you used Word to create it).

answered Sep 14, 2015 by Horace08Byik
0 votes

I see this thread turns up in external links and has the occasional post so I thought I'd post an update (hope no one minds). OpenOffice continues to evolve and release 3.2 improves the word import export filters again. OpenOffice and Java can run on many platforms so Java systems can make use of the OpenOffice UNO API directly to import/manipulate/export documents in many formats (including word and pdf) or use a library like JODReports or Docmosis to facilitate. Both have free/open options.

answered Sep 14, 2015 by SteAQYY
0 votes

If you are targeting word 2007 files using the ooxml format then this article might help. And there is the Ooxml4j project which is implementing ooxml for Java library.

If you are targeting the binary files though...thats another problem.

answered Sep 14, 2015 by ErmelindPink
0 votes

You'd have to find the MS word doc specification ( since it is basically a binary dump of whatever is in word at that point in time ), and slowly go through it element by element converting ms word "objects/states" to the html equiv. you might be able to find a script to do it for u since this really isn't fun work and i'd advise against it ( converting file formats or even reading from commercial files on your own is always hard and often incomplete ). PS: just google doc2html

answered Sep 14, 2015 by KassandraPet
0 votes
import officetools.OfficeFile; // package available at
FileInputStream fis = new FileInputStream(new File("test.doc"));
FileOutputStream fos = new FileOutputStream(new File("test.html"));
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);

All possible conversions:

doc --> pdf, html, txt, rtf

xls --> pdf, html, csv

ppt --> pdf, swf

html --> pdf

answered Sep 14, 2015 by MichelleHack
0 votes

I tried this way and its work with me from this site

This only work with docx to convert it into html included images inside that word document.

    // 1) Load DOCX into XWPFDocument
    InputStream doc = new FileInputStream(new File("c:/document.docx"));
    XWPFDocument document = new XWPFDocument(doc);

   // 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
            XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;

            // 3) Extract image
            String root = "target";
            File imageFolder = new File( root + "/images/" + doc );
            options.setExtractor( new FileImageExtractor( imageFolder ) );
            // 4) URI resolver
            options.URIResolver( new FileURIResolver( imageFolder ) );

            OutputStream out = new FileOutputStream(new File("c:/document.html"));
            XHTMLConverter.getInstance().convert(document, out, options);

I hope this solve your issue

answered Sep 14, 2015 by IndFadden