Table of Contents
1. Using javax.swing.text.html.HTMLEditorKit
import java.io.IOException; import java.io.FileReader; import java.io.Reader; import java.util.List; import java.util.ArrayList; import javax.swing.text.html.parser.ParserDelegator; import javax.swing.text.html.HTMLEditorKit.ParserCallback; import javax.swing.text.html.HTML.Tag; import javax.swing.text.html.HTML.Attribute; import javax.swing.text.MutableAttributeSet; public class HTMLUtils { private HTMLUtils() {} public static List<String> extractLinks(Reader reader) throws IOException { final ArrayList<String> list = new ArrayList<String>(); ParserDelegator parserDelegator = new ParserDelegator(); ParserCallback parserCallback = new ParserCallback() { public void handleText(final char[] data, final int pos) { } public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { if (tag == Tag.A) { String address = (String) attribute.getAttribute(Attribute.HREF); list.add(address); } } public void handleEndTag(Tag t, final int pos) { } public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { } public void handleComment(final char[] data, final int pos) { } public void handleError(final java.lang.String errMsg, final int pos) { } }; parserDelegator.parse(reader, parserCallback, false); return list; } public final static void main(String[] args) throws Exception{ FileReader reader = new FileReader("java-new.html"); List<String> links = HTMLUtils.extractLinks(reader); for (String link : links) { System.out.println(link); } } }
2. Using an HTML parser
In this HowTo, I will use the OpenSource package Jsoup.
import java.io.IOException; import java.util.List; import java.util.ArrayList; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class HTMLUtils { private HTMLUtils() {} public static List<String>extractLinks(String url) throws IOException { final ArrayList<String> result = new ArrayList<String>(); Document doc = Jsoup.connect(url).get(); Elements links = doc.select("a[href]"); Elements media = doc.select("[src]"); Elements imports = doc.select("link[href]"); // href ... for (Element link : links) { result.add(link.attr("abs:href")); } // img ... for (Element src : media) { result.add(src.attr("abs:src")); } // js, css, ... for (Element link : imports) { result.add(link.attr("abs:href")); } return result; } public final static void main(String[] args) throws Exception{ String site = "http://www.rgagnon.com/topics/java-language.html"; List<String> links = HTMLUtils.extractLinks(site); for (String link : links) { System.out.println(link); } } }
Done! Happy Coding!
Related posts:
Java Program to Implement Gabow Algorithm
Java Program to Check if a Matrix is Invertible
Tips for dealing with HTTP-related problems
Java – Try with Resources
An Introduction to Java.util.Hashtable Class
Java Program to Implement VList
Java Program to Implement Merge Sort Algorithm on Linked List
Java Program to implement Circular Buffer
Tìm hiểu cơ chế Lazy Evaluation của Stream trong Java 8
Java Program to Implement Dijkstra’s Algorithm using Queue
New Features in Java 13
Serverless Functions with Spring Cloud Function
Java Program to Show the Duality Transformation of Line and Point
Spring Data Reactive Repositories with MongoDB
Sao chép các phần tử của một mảng sang mảng khác như thế nào?
Java Program to Find Nearest Neighbor for Dynamic Data Set
Hướng dẫn Java Design Pattern – Bridge
ETags for REST with Spring
Guide to Java OutputStream
Spring Security Authentication Provider
Java – Random Long, Float, Integer and Double
Java Program to Implement Weight Balanced Tree
Java Program to Implement Rolling Hash
Validate email address exists or not by Java Code
Using JWT with Spring Security OAuth
A Guide to HashSet in Java
Tổng quan về ngôn ngữ lập trình java
Rest Web service: Filter và Interceptor với Jersey 2.x (P2)
Spring Boot with Multiple SQL Import Files
Tính đa hình (Polymorphism) trong Java
Java Program to Solve Tower of Hanoi Problem using Stacks
The Thread.join() Method in Java