📜  jsoup - Javascript (1)

📅  最后修改于: 2023-12-03 15:17:05.066000             🧑  作者: Mango

Jsoup - Javascript

Jsoup is a Java library for working with HTML documents, providing a set of APIs for extracting and manipulating data using the DOM, CSS, and jQuery-like methods. It can be used in conjunction with frameworks like Spring, Hibernate, and Struts, and is compatible with both JVM and Android.

Features

Some of the main features of Jsoup are:

  • Parse HTML from a URL, file, or string
  • Extract data from HTML using CSS selectors
  • Manipulate the HTML DOM programmatically
  • Clean and sanitize HTML input to avoid XSS attacks
  • Support for non-English languages and character encodings
  • Use of the Java Servlet API for session handling and cookie management
Installation

To use Jsoup in your Java project, you can add the following Maven dependency:

<dependencies>
  <dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
  </dependency>
</dependencies>

Alternatively, you can download the JAR file from the official website and add it to your project's classpath.

Examples
Parsing HTML
Document doc = Jsoup.connect("https://www.example.com").get();
System.out.println(doc.title());

This example downloads the HTML document from https://www.example.com and prints its title to the console.

Extracting Data
Document doc = Jsoup.connect("https://www.example.com").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
    System.out.println(link.attr("href"));
}

This example extracts all the links from the HTML document and prints their URLs to the console.

Manipulating the DOM
Document doc = Jsoup.connect("https://www.example.com").get();
Element link = doc.select("a").first();
link.attr("href", "https://www.google.com");
System.out.println(link);

This example changes the URL of the first link in the HTML document to https://www.google.com and prints the modified link to the console.

Cleaning HTML Input
String dirtyHtml = "<p><script>alert('XSS')</script>Example</p>";
String cleanHtml = Jsoup.clean(dirtyHtml, Whitelist.basic());
System.out.println(cleanHtml);

This example cleans the input HTML string by removing any script tags and other potentially malicious content, and prints the sanitized HTML to the console.

Conclusion

Jsoup is a powerful tool for working with HTML documents in Java, providing a comprehensive set of APIs for parsing, manipulating, and sanitizing HTML input. Whether you're building web scrapers, data analysis tools, or full-blown web applications, Jsoup can help you get the job done quickly and easily.