Table of Contents
1. Using javax.swing.text.html.HTMLEditorKit
import java.io.IOException; import java.io.FileReader; import java.io.Reader; import java.util.List; import java.util.ArrayList; import javax.swing.text.html.parser.ParserDelegator; import javax.swing.text.html.HTMLEditorKit.ParserCallback; import javax.swing.text.html.HTML.Tag; import javax.swing.text.html.HTML.Attribute; import javax.swing.text.MutableAttributeSet; public class HTMLUtils { private HTMLUtils() {} public static List<String> extractLinks(Reader reader) throws IOException { final ArrayList<String> list = new ArrayList<String>(); ParserDelegator parserDelegator = new ParserDelegator(); ParserCallback parserCallback = new ParserCallback() { public void handleText(final char[] data, final int pos) { } public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { if (tag == Tag.A) { String address = (String) attribute.getAttribute(Attribute.HREF); list.add(address); } } public void handleEndTag(Tag t, final int pos) { } public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { } public void handleComment(final char[] data, final int pos) { } public void handleError(final java.lang.String errMsg, final int pos) { } }; parserDelegator.parse(reader, parserCallback, false); return list; } public final static void main(String[] args) throws Exception{ FileReader reader = new FileReader("java-new.html"); List<String> links = HTMLUtils.extractLinks(reader); for (String link : links) { System.out.println(link); } } }
2. Using an HTML parser
In this HowTo, I will use the OpenSource package Jsoup.
import java.io.IOException; import java.util.List; import java.util.ArrayList; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class HTMLUtils { private HTMLUtils() {} public static List<String>extractLinks(String url) throws IOException { final ArrayList<String> result = new ArrayList<String>(); Document doc = Jsoup.connect(url).get(); Elements links = doc.select("a[href]"); Elements media = doc.select("[src]"); Elements imports = doc.select("link[href]"); // href ... for (Element link : links) { result.add(link.attr("abs:href")); } // img ... for (Element src : media) { result.add(src.attr("abs:src")); } // js, css, ... for (Element link : imports) { result.add(link.attr("abs:href")); } return result; } public final static void main(String[] args) throws Exception{ String site = "http://www.rgagnon.com/topics/java-language.html"; List<String> links = HTMLUtils.extractLinks(site); for (String link : links) { System.out.println(link); } } }
Done! Happy Coding!
Related posts:
Java Program to Compute Discrete Fourier Transform Using the Fast Fourier Transform Approach
Cơ chế Upcasting và Downcasting trong java
Java Program to Implement Nth Root Algorithm
Java Program for Douglas-Peucker Algorithm Implementation
Hướng dẫn sử dụng Java Generics
Java Program to Implement Gabow Algorithm
Guide to Character Encoding
Java Program to Implement LinkedList API
A Guide to JUnit 5 Extensions
HttpAsyncClient Tutorial
Spring Webflux with Kotlin
Java Program to Implement Efficient O(log n) Fibonacci generator
An Intro to Spring Cloud Task
Java Program to Implement ArrayBlockingQueue API
Quick Guide to the Java StringTokenizer
Spring NoSuchBeanDefinitionException
A Quick Guide to Spring Cloud Consul
Constructor Dependency Injection in Spring
Java Program to Implement Floyd-Warshall Algorithm
Using Spring @ResponseStatus to Set HTTP Status Code
Hướng dẫn Java Design Pattern – State
Java Program to Implement WeakHashMap API
Testing an OAuth Secured API with Spring MVC
Spring Boot - Cloud Configuration Server
Spring Boot - Google Cloud Platform
Mệnh đề Switch-case trong java
Convert char to String in Java
Java Program to Implement the Binary Counting Method to Generate Subsets of a Set
Spring Security – Reset Your Password
Java Program to Implement CountMinSketch
Overview of Spring Boot Dev Tools
Java – InputStream to Reader