📜  从 HTML 文档中提取内容的Java程序

📅  最后修改于: 2022-05-13 01:54:43.301000             🧑  作者: Mango

从 HTML 文档中提取内容的Java程序

HTML 是网络的核心,你在互联网上看到的所有页面都是 HTML,无论它们是由 JavaScript、JSP、 PHP、ASP 或任何其他网络技术动态生成的。您的浏览器实际上会解析 HTML 并为您呈现,但是如果我们需要解析 HTML 文档并查找一些元素、标签、属性或检查特定元素是否存在。在Java中,我们可以提取 HTML 内容并解析 HTML 文档。

方法:

  1. 使用文件阅读
  2. 使用Url.openStream()

方法 1:名为 FileReader 的库提供了读取任何文件的方法,而不管任何扩展名。将 HTML 行附加到 String Builder 的方法如下:

  1. 使用 FileReader 从源文件夹中读取文件并进一步
  2. 将每一行附加到字符串构建器。
  3. 当 HTML 文档中没有任何内容时,请使用函数br.close()关闭打开的文件
  4. 打印出字符串。

执行:

Java
// Java Program to Extract Content from a HTML document
 
// Importing input/output java libraries
import java.io.*;
 
public class GFG {
 
    // Main driver method
    public static void main(String[] args)
        throws FileNotFoundException
    {
 
        /* Constructing String Builder to
        append the string into the html */
        StringBuilder html = new StringBuilder();
 
        // Reading html file on local directory
        FileReader fr = new FileReader(
            "C:\\Users\\rohit\\OneDrive\\Desktop\\article.html");
 
        // Try block to check exceptions
        try {
 
            // Initialization of the buffered Reader to get
            // the String append to the String Builder
            BufferedReader br = new BufferedReader(fr);
 
            String val;
 
            // Reading the String till we get the null
            // string and appending to the string
            while ((val = br.readLine()) != null) {
                html.append(val);
            }
 
            // AtLast converting into the string
            String result = html.toString();
            System.out.println(result);
 
            // Closing the file after all the completion of
            // Extracting
            br.close();
        }
 
        // Catch block to handle exceptions
        catch (Exception ex) {
 
            /* Exception of not finding the location and
            string reading termination the function
            br.close(); */
            System.out.println(ex.getMessage());
        }
    }
}


Java
// Java Program to Extract Content from a HTML document
 
// Importing hava generic class
import java.io.*;
import java.util.*;
// Importing java URL class
import java.net.URL;
 
public class GFG {
 
    // Man driver method
    public static void main(String[] args)
        throws FileNotFoundException
    {
 
        // Try block to check exceptions
        try {
            String val;
 
            // Constructing the URL connection
            // by defining the URL constructors
            URL URL = new URL(
                "file:///C:/Users/rohit/OneDrive/Desktop/article.html");
 
            // Reading the HTML content from the .HTML File
            BufferedReader br = new BufferedReader(
                new InputStreamReader(URL.openStream()));
 
            /* Catching the string and  if found any null
             character break the String */
            while ((val = br.readLine()) != null) {
                System.out.println(val);
            }
 
            // Closing the file
            br.close();
        }
 
        // Catch block to handle exceptions
        catch (Exception ex) {
 
            // No file found
            System.out.println(ex.getMessage());
        }
    }
}


输出:

方法 2:使用 Url.openStream()

  • 调用url.openStream() 启动新 TCP 的函数 连接到 URL 提供它的服务器。
  • 现在,在服务器将包含信息的 HTTP 响应发送回连接后,HTTP 获取请求被发送到连接。
  • 该信息采用字节的形式,然后使用InputStreamReader()openStream()方法读取该信息将数据返回给程序。
BufferedReader br = new BufferedReader(new InputStreamReader(URL.openStream()));
  • 首先,我们使用 开放流() 来获取信息。如果连接一切正常(表示显示 200),则该信息以字节的形式包含在 URL 中,然后 HTTP 请求到 网址 获取内容。
  • 然后使用字节以字节的形式收集信息 输入流读取器()
  • 现在运行循环以打印信息,因为需求是在控制台中打印信息。
while ((val = br.readLine()) != null)   // condition
 {    
   System.out.println(val);             // execution if condition is true
  }

执行:

Java

// Java Program to Extract Content from a HTML document
 
// Importing hava generic class
import java.io.*;
import java.util.*;
// Importing java URL class
import java.net.URL;
 
public class GFG {
 
    // Man driver method
    public static void main(String[] args)
        throws FileNotFoundException
    {
 
        // Try block to check exceptions
        try {
            String val;
 
            // Constructing the URL connection
            // by defining the URL constructors
            URL URL = new URL(
                "file:///C:/Users/rohit/OneDrive/Desktop/article.html");
 
            // Reading the HTML content from the .HTML File
            BufferedReader br = new BufferedReader(
                new InputStreamReader(URL.openStream()));
 
            /* Catching the string and  if found any null
             character break the String */
            while ((val = br.readLine()) != null) {
                System.out.println(val);
            }
 
            // Closing the file
            br.close();
        }
 
        // Catch block to handle exceptions
        catch (Exception ex) {
 
            // No file found
            System.out.println(ex.getMessage());
        }
    }
}

输出: