Friday, February 22, 2019

Apache POI Word - Text Extraction

This chapter explains how to extract simple text data from a Word document using Java. In case you want to extract metadata from a Word document, make use of Apache Tika.
For .docx files, we use the class org.apache.poi.xwpf.extractor.XPFFWordExtractor that extracts and returns simple data from a Word file. In the same way, we have different methodologies to extract headings, footnotes, table data, etc. from a Word file.
The following code shows how to extract simple text from a Word file −
import java.io.FileInputStream;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class WordExtractor {

   public static void main(String[] args)throws Exception {

      XWPFDocument docx = new XWPFDocument(new FileInputStream("create_paragraph.docx"));
      
      //using XWPFWordExtractor Class
      XWPFWordExtractor we = new XWPFWordExtractor(docx);
      System.out.println(we.getText());
   }
}
Save the above code as WordExtractor.java. Compile and execute it from the command prompt as follows −
$javac WordExtractor.java
$java WordExtractor
It will generate the following output:
At tutorialspoint.com, we strive hard to provide quality tutorials for self-learning
purpose in the domains of Academics, Information Technology, Management and Computer
Programming Languages.

No comments:

Post a Comment

Popular Posts