Table of Contents
1. Using javax.swing.text.html.HTMLEditorKit
import java.io.IOException; import java.io.FileReader; import java.io.Reader; import java.util.List; import java.util.ArrayList; import javax.swing.text.html.parser.ParserDelegator; import javax.swing.text.html.HTMLEditorKit.ParserCallback; import javax.swing.text.html.HTML.Tag; import javax.swing.text.html.HTML.Attribute; import javax.swing.text.MutableAttributeSet; public class HTMLUtils { private HTMLUtils() {} public static List<String> extractLinks(Reader reader) throws IOException { final ArrayList<String> list = new ArrayList<String>(); ParserDelegator parserDelegator = new ParserDelegator(); ParserCallback parserCallback = new ParserCallback() { public void handleText(final char[] data, final int pos) { } public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { if (tag == Tag.A) { String address = (String) attribute.getAttribute(Attribute.HREF); list.add(address); } } public void handleEndTag(Tag t, final int pos) { } public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { } public void handleComment(final char[] data, final int pos) { } public void handleError(final java.lang.String errMsg, final int pos) { } }; parserDelegator.parse(reader, parserCallback, false); return list; } public final static void main(String[] args) throws Exception{ FileReader reader = new FileReader("java-new.html"); List<String> links = HTMLUtils.extractLinks(reader); for (String link : links) { System.out.println(link); } } }
2. Using an HTML parser
In this HowTo, I will use the OpenSource package Jsoup.
import java.io.IOException; import java.util.List; import java.util.ArrayList; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class HTMLUtils { private HTMLUtils() {} public static List<String>extractLinks(String url) throws IOException { final ArrayList<String> result = new ArrayList<String>(); Document doc = Jsoup.connect(url).get(); Elements links = doc.select("a[href]"); Elements media = doc.select("[src]"); Elements imports = doc.select("link[href]"); // href ... for (Element link : links) { result.add(link.attr("abs:href")); } // img ... for (Element src : media) { result.add(src.attr("abs:src")); } // js, css, ... for (Element link : imports) { result.add(link.attr("abs:href")); } return result; } public final static void main(String[] args) throws Exception{ String site = "http://www.rgagnon.com/topics/java-language.html"; List<String> links = HTMLUtils.extractLinks(site); for (String link : links) { System.out.println(link); } } }
Done! Happy Coding!
Related posts:
Using Java Assertions
A Guide to BitSet in Java
Java Program to Implement an Algorithm to Find the Global min Cut in a Graph
Introduction to Spring Data JDBC
Sort a HashMap in Java
Java Program to Implement Binary Tree
Spring Security Custom AuthenticationFailureHandler
Java Program to Perform Postorder Recursive Traversal of a Given Binary Tree
Java Program to Implement Bit Array
Batch Processing with Spring Cloud Data Flow
Java List UnsupportedOperationException
Lập trình đa luồng trong Java (Java Multi-threading)
Hướng dẫn Java Design Pattern – Interpreter
Java Program to Implement D-ary-Heap
Java Program to Perform Arithmetic Operations on Numbers of Size
Marker Interface trong Java
Java Program to Implement Extended Euclid Algorithm
Java Program to Implement Singly Linked List
Spring Boot - Securing Web Applications
Java Program to Implement Regular Falsi Algorithm
Versioning a REST API
Spring Boot Configuration with Jasypt
Using Spring ResponseEntity to Manipulate the HTTP Response
Custom Thread Pools In Java 8 Parallel Streams
Java Program to Find Median of Elements where Elements are Stored in 2 Different Arrays
Ép kiểu trong Java (Type casting)
How to Convert List to Map in Java
Removing all duplicates from a List in Java
Filtering a Stream of Optionals in Java
Spring Data MongoDB Transactions
What is a POJO Class?
DistinctBy in the Java Stream API