Table of Contents
1. Using javax.swing.text.html.HTMLEditorKit
import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.util.List;
import java.util.ArrayList;
import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTML.Attribute;
import javax.swing.text.MutableAttributeSet;
public class HTMLUtils {
private HTMLUtils() {}
public static List<String> extractLinks(Reader reader) throws IOException {
final ArrayList<String> list = new ArrayList<String>();
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
public void handleText(final char[] data, final int pos) { }
public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) {
if (tag == Tag.A) {
String address = (String) attribute.getAttribute(Attribute.HREF);
list.add(address);
}
}
public void handleEndTag(Tag t, final int pos) { }
public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
public void handleComment(final char[] data, final int pos) { }
public void handleError(final java.lang.String errMsg, final int pos) { }
};
parserDelegator.parse(reader, parserCallback, false);
return list;
}
public final static void main(String[] args) throws Exception{
FileReader reader = new FileReader("java-new.html");
List<String> links = HTMLUtils.extractLinks(reader);
for (String link : links) {
System.out.println(link);
}
}
}
2. Using an HTML parser
In this HowTo, I will use the OpenSource package Jsoup.
import java.io.IOException;
import java.util.List;
import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class HTMLUtils {
private HTMLUtils() {}
public static List<String>extractLinks(String url) throws IOException {
final ArrayList<String> result = new ArrayList<String>();
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");
// href ...
for (Element link : links) {
result.add(link.attr("abs:href"));
}
// img ...
for (Element src : media) {
result.add(src.attr("abs:src"));
}
// js, css, ...
for (Element link : imports) {
result.add(link.attr("abs:href"));
}
return result;
}
public final static void main(String[] args) throws Exception{
String site = "http://www.rgagnon.com/topics/java-language.html";
List<String> links = HTMLUtils.extractLinks(site);
for (String link : links) {
System.out.println(link);
}
}
}
Done! Happy Coding!
Related posts:
Vấn đề Nhà sản xuất (Producer) – Người tiêu dùng (Consumer) và đồng bộ hóa các luồng trong Java
Java Program to Implement AttributeList API
Custom HTTP Header with the HttpClient
Java Program to Check Whether an Input Binary Tree is the Sub Tree of the Binary Tree
Spring WebClient Filters
Function trong Java 8
Java Program to Check whether Graph is a Bipartite using DFS
Getting a File’s Mime Type in Java
Concrete Class in Java
Introduction to the Java NIO2 File API
Spring REST API + OAuth2 + Angular
Converting a Stack Trace to a String in Java
A Guide to EnumMap
Quản lý bộ nhớ trong Java với Heap Space vs Stack
Tạo chương trình Java đầu tiên sử dụng Eclipse
New Features in Java 13
Receive email using POP3
Java Program to Implement ConcurrentHashMap API
An Intro to Spring Cloud Vault
Java Program to Implement LinkedList API
Java Program to Perform Quick Sort on Large Number of Elements
Sử dụng CyclicBarrier trong Java
Spring Boot - Tomcat Port Number
Java Program to Implement Radix Sort
Java Program to Check the Connectivity of Graph Using DFS
A Guide to System.exit()
Guide to Guava Multimap
Disable DNS caching
OAuth 2.0 Resource Server With Spring Security 5
Working with Kotlin and JPA
Java Program to Implement Merge Sort Algorithm on Linked List
Generic Constructors in Java