Remove HTML tags from a file to extract only the TEXT

1. Using regular expression

A special regular expression is used to strip out anything between a < and >

import java.io.*;

public class Html2TextWithRegExp {
   private Html2TextWithRegExp() {}

   public static void main (String[] args) throws Exception{
     StringBuilder sb = new StringBuilder();
     BufferedReader br = new BufferedReader(new FileReader("java-new.html"));
     String line;
     while ( (line=br.readLine()) != null) {
       sb.append(line);
       // or
       //  sb.append(line).append(System.getProperty("line.separator"));
     }
     String nohtml = sb.toString().replaceAll("\\<.*?>","");
     System.out.println(nohtml);
   }
}

However if any Javascript is present, the script will be seen as text. Also you may need to add some logic during the reading to take into account only what is inside the <BODY> tag.

2. Using javax.swing.text.html.HTMLEditorKit

In most cases, the HTMLEditorKit is used with a JEditorPane text component but it can be also used directly to extract text from an HTML page.

import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.util.List;
import java.util.ArrayList;

import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.MutableAttributeSet;

public class HTMLUtils {
  private HTMLUtils() {}

  public static List<String> extractText(Reader reader) throws IOException {
    final ArrayList<String> list = new ArrayList<String>();

    ParserDelegator parserDelegator = new ParserDelegator();
    ParserCallback parserCallback = new ParserCallback() {
      public void handleText(final char[] data, final int pos) {
        list.add(new String(data));
      }
      public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { }
      public void handleEndTag(Tag t, final int pos) {  }
      public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
      public void handleComment(final char[] data, final int pos) { }
      public void handleError(final java.lang.String errMsg, final int pos) { }
    };
    parserDelegator.parse(reader, parserCallback, true);
    return list;
  }

  public final static void main(String[] args) throws Exception{
    FileReader reader = new FileReader("java-new.html");
    List<String> lines = HTMLUtils.extractText(reader);
    for (String line : lines) {
      System.out.println(line);
    }
  }
}

Note that the HTMLEditorKit can be easily confused if the HTML to be parsed is not well-formed.

3. Using an HTML parser

This is maybe the best solution (if the choosen parser is good !).

There are many parsers available on the net. In this HowTo, I will use the OpenSource package Jsoup.

Jsoup is entirely self contained and has no dependencies which is a good thing.

import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.io.BufferedReader;
import org.jsoup.Jsoup;

public class HTMLUtils {
  private HTMLUtils() {}

  public static String extractText(Reader reader) throws IOException {
    StringBuilder sb = new StringBuilder();
    BufferedReader br = new BufferedReader(reader);
    String line;
    while ( (line=br.readLine()) != null) {
      sb.append(line);
    }
    String textOnly = Jsoup.parse(sb.toString()).text();
    return textOnly;
  }

  public final static void main(String[] args) throws Exception{
    FileReader reader = new FileReader
          ("C:/RealHowTo/topics/java-language.html");
    System.out.println(HTMLUtils.extractText(reader));
  }
}

4. Using Apache Tika

import java.io.FileInputStream;
import java.io.InputStream;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

public class ParseHTMLWithTika {
  public static void main(String args[]) throws Exception {

    InputStream is = null;
    try {

         is = new FileInputStream("C:/Temp/java-x.html");
         ContentHandler contenthandler = new BodyContentHandler();
         Metadata metadata = new Metadata();
         Parser parser = new AutoDetectParser();
         parser.parse(is, contenthandler, metadata, new ParseContext());
         System.out.println(contenthandler.toString());
    }
    catch (Exception e) {
      e.printStackTrace();
    }
    finally {
        if (is != null) is.close();
    }
  }
}

Done! Happy Coding!

Related posts:

Java Program to Represent Graph Using Adjacency Matrix
Java Program to Remove the Edges in a Given Cyclic Graph such that its Linear Extension can be Found
Java Program to Implement Merge Sort on n Numbers Without tail-recursion
Java Program to Find Whether a Path Exists Between 2 Given Nodes
Case-Insensitive String Matching in Java
ClassNotFoundException vs NoClassDefFoundError
Java Byte Array to InputStream
Java Program to Implement ConcurrentHashMap API
Java Program to Implement Queue
Giới thiệu về Stream API trong Java 8
Lập trình đa luồng với Callable và Future trong Java
Vấn đề Nhà sản xuất (Producer) – Người tiêu dùng (Consumer) và đồng bộ hóa các luồng trong Java
Spring Data JPA and Null Parameters
Java Program to find the maximum subarray sum O(n^2) time(naive method)
Java Program to Implement Naor-Reingold Pseudo Random Function
Java Program to Construct a Random Graph by the Method of Random Edge Selection
Guide to CopyOnWriteArrayList
Spring Boot - Rest Template
Removing all Nulls from a List in Java
A Guide to Queries in Spring Data MongoDB
Mapping Nested Values with Jackson
Quick Guide to Spring Bean Scopes
Collection trong java
Request Method Not Supported (405) in Spring
Java Program to Compute the Volume of a Tetrahedron Using Determinants
Java Program to Check if a Given Set of Three Points Lie on a Single Line or Not
Spring Data JPA @Modifying Annotation
Spring Data JPA @Query
Java Program to Check if a Point d lies Inside or Outside a Circle Defined by Points a, b, c in a Pl...
Custom Error Pages with Spring MVC
The Registration Process With Spring Security
Debugging Reactive Streams in Java