Table of Contents
1. Using javax.swing.text.html.HTMLEditorKit
import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.util.List;
import java.util.ArrayList;
import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTML.Attribute;
import javax.swing.text.MutableAttributeSet;
public class HTMLUtils {
private HTMLUtils() {}
public static List<String> extractLinks(Reader reader) throws IOException {
final ArrayList<String> list = new ArrayList<String>();
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
public void handleText(final char[] data, final int pos) { }
public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) {
if (tag == Tag.A) {
String address = (String) attribute.getAttribute(Attribute.HREF);
list.add(address);
}
}
public void handleEndTag(Tag t, final int pos) { }
public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
public void handleComment(final char[] data, final int pos) { }
public void handleError(final java.lang.String errMsg, final int pos) { }
};
parserDelegator.parse(reader, parserCallback, false);
return list;
}
public final static void main(String[] args) throws Exception{
FileReader reader = new FileReader("java-new.html");
List<String> links = HTMLUtils.extractLinks(reader);
for (String link : links) {
System.out.println(link);
}
}
}
2. Using an HTML parser
In this HowTo, I will use the OpenSource package Jsoup.
import java.io.IOException;
import java.util.List;
import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class HTMLUtils {
private HTMLUtils() {}
public static List<String>extractLinks(String url) throws IOException {
final ArrayList<String> result = new ArrayList<String>();
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");
// href ...
for (Element link : links) {
result.add(link.attr("abs:href"));
}
// img ...
for (Element src : media) {
result.add(src.attr("abs:src"));
}
// js, css, ...
for (Element link : imports) {
result.add(link.attr("abs:href"));
}
return result;
}
public final static void main(String[] args) throws Exception{
String site = "http://www.rgagnon.com/topics/java-language.html";
List<String> links = HTMLUtils.extractLinks(site);
for (String link : links) {
System.out.println(link);
}
}
}
Done! Happy Coding!
Related posts:
Java – Try with Resources
Java – Write a Reader to File
Java Program to implement Array Deque
Java Program to Implement Best-First Search
Java – Reader to InputStream
Mệnh đề Switch-case trong java
How to Iterate Over a Stream With Indices
Rest Web service: Filter và Interceptor với Jersey 2.x (P2)
Giới thiệu Google Guice – Dependency injection (DI) framework
Spring – Injecting Collections
Java Program to Generate All Possible Subsets with Exactly k Elements in Each Subset
Java Program to Use Dynamic Programming to Solve Approximate String Matching
Java Program to Implement Cubic convergence 1/pi Algorithm
Spring Security Basic Authentication
Giới thiệu Design Patterns
Java Program to Generate Random Numbers Using Middle Square Method
Java Program to Find Shortest Path Between All Vertices Using Floyd-Warshall’s Algorithm
Java – Reader to Byte Array
Chuyển đổi giữa các kiểu dữ liệu trong Java
Template Engines for Spring
Spring AMQP in Reactive Applications
How to Return 404 with Spring WebFlux
Hashing a Password in Java
Java Program to Implement CopyOnWriteArrayList API
Java Program to Implement Dijkstra’s Algorithm using Set
Java Program to find the maximum subarray sum O(n^2) time(naive method)
Hướng dẫn Java Design Pattern – Observer
Period and Duration in Java
Java Program to Implement Stack
String Operations with Java Streams
Guide to Spring 5 WebFlux
Java Program to Check whether Undirected Graph is Connected using BFS