📜  如何在Python中提取 youtube 数据?

📅  最后修改于: 2022-05-13 01:54:38.131000             🧑  作者: Mango

如何在Python中提取 youtube 数据?

先决条件: Beautifulsoup

YouTube 频道的 YouTube 统计数据可用于分析,也可以使用Python代码提取。可以检索很多数据,如 viewCount、subscriberCount 和 videoCount。本文讨论了两种可以完成的方法。

方法 1:使用 YouTube API

首先,我们需要生成一个 API 密钥。您需要一个 Google 帐户才能访问 Google API 控制台、请求 API 密钥和注册您的应用程序。您可以使用 Google API 页面来执行此操作。

要提取数据,我们需要要查看其统计信息的 YouTube 频道的频道 ID。要获取频道 ID,请访问该特定 YouTube 频道并复制 URL 的最后一部分(在下面给出的示例中,使用了 GeeksForGeeks 频道的频道 ID)。

方法

  • 首先创建 youtube_statistics.py
  • 在这个文件中使用 YTstats 类提取数据并生成一个 json 文件将提取的所有数据。
  • 现在创建 main.py
  • 在主要导入 youtube_statistics.py
  • 添加 API 密钥和频道 ID
  • 现在使用与给定键对应的第一个文件数据将被检索并保存到 json 文件中。

例子 :

main.py 文件的代码:

Python3
from youtube_statistics import YTstats
  
# paste the API key generated by you here
API_KEY = "AIzaSyA-0KfpLK04NpQN1XghxhSlzG-WkC3DHLs"
  
 # paste the channel id here
channel_id = "UC0RhatS1pyxInC00YKjjBqQ" 
  
yt = YTstats(API_KEY, channel_id)
yt.get_channel_statistics()
yt.dump()


Python3
import requests
import json
  
  
class YTstats:
  
    def __init__(self, api_key, channel_id):
        self.api_key = api_key
        self.channel_id = channel_id
        self.channel_statistics = None
  
    def get_channel_statistics(self):
        url = f'https://www.googleapis.com/youtube/v3/channels?part=statistics&id={self.channel_id}&key={self.api_key}'
  
        json_url = requests.get(url)
        data = json.loads(json_url.text)
  
        try:
            data = data["items"][0]["statistics"]
        except:
            data = None
  
        self.channel_statistics = data
        return data
  
    def dump(self):
        if self.channel_statistics is None:
            return
  
        channel_title = "GeeksForGeeks"
        channel_title = channel_title.replace(" ", "_").lower()
  
        # generate a json file with all the statistics data of the youtube channel
        file_name = channel_title + '.json'
        with open(file_name, 'w') as f:
            json.dump(self.channel_statistics, f, indent=4)
        print('file dumped')


Python3
# import required packages
from selenium import webdriver
from bs4 import BeautifulSoup
  
# provide the url of the channel whose data you want to fetch
urls = [
    'https://www.youtube.com/channel/UC0RhatS1pyxInC00YKjjBqQ'
]
  
  
def main():
    driver = webdriver.Chrome()
    for url in urls:
        driver.get('{}/videos?view=0&sort=p&flow=grid'.format(url))
        content = driver.page_source.encode('utf-8').strip()
        soup = BeautifulSoup(content, 'lxml')
        titles = soup.findAll('a', id='video-title')
        views = soup.findAll(
            'span', class_='style-scope ytd-grid-video-renderer')
        video_urls = soup.findAll('a', id='video-title')
        print('Channel: {}'.format(url))
        i = 0  # views and time
        j = 0  # urls
        for title in titles[:10]:
            print('\n{}\t{}\t{}\thttps://www.youtube.com{}'.format(title.text,
                                                                   views[i].text, views[i+1].text, video_urls[j].get('href')))
            i += 2
            j += 1
  
  
main()


youtube_statistics.py 文件的代码:

蟒蛇3

import requests
import json
  
  
class YTstats:
  
    def __init__(self, api_key, channel_id):
        self.api_key = api_key
        self.channel_id = channel_id
        self.channel_statistics = None
  
    def get_channel_statistics(self):
        url = f'https://www.googleapis.com/youtube/v3/channels?part=statistics&id={self.channel_id}&key={self.api_key}'
  
        json_url = requests.get(url)
        data = json.loads(json_url.text)
  
        try:
            data = data["items"][0]["statistics"]
        except:
            data = None
  
        self.channel_statistics = data
        return data
  
    def dump(self):
        if self.channel_statistics is None:
            return
  
        channel_title = "GeeksForGeeks"
        channel_title = channel_title.replace(" ", "_").lower()
  
        # generate a json file with all the statistics data of the youtube channel
        file_name = channel_title + '.json'
        with open(file_name, 'w') as f:
            json.dump(self.channel_statistics, f, indent=4)
        print('file dumped')

输出:

方法二:使用 BeautifulSoup

Beautiful Soup 是一个Python库,用于从 HTML 和 XML 文件中提取数据。在这种方法中,我们将使用 BeautifulSoup 和Selenium从 YouTube 频道中抓取数据。该程序将告知视频的观看次数、发布时间、标题和网址,并使用 Python 的格式进行打印。

方法

  • 导入模块
  • 提供要获取其数据的频道的 url
  • 提取数据
  • 显示获取的数据。

例子:

蟒蛇3

# import required packages
from selenium import webdriver
from bs4 import BeautifulSoup
  
# provide the url of the channel whose data you want to fetch
urls = [
    'https://www.youtube.com/channel/UC0RhatS1pyxInC00YKjjBqQ'
]
  
  
def main():
    driver = webdriver.Chrome()
    for url in urls:
        driver.get('{}/videos?view=0&sort=p&flow=grid'.format(url))
        content = driver.page_source.encode('utf-8').strip()
        soup = BeautifulSoup(content, 'lxml')
        titles = soup.findAll('a', id='video-title')
        views = soup.findAll(
            'span', class_='style-scope ytd-grid-video-renderer')
        video_urls = soup.findAll('a', id='video-title')
        print('Channel: {}'.format(url))
        i = 0  # views and time
        j = 0  # urls
        for title in titles[:10]:
            print('\n{}\t{}\t{}\thttps://www.youtube.com{}'.format(title.text,
                                                                   views[i].text, views[i+1].text, video_urls[j].get('href')))
            i += 2
            j += 1
  
  
main()

输出