從網頁的數字範圍中提取鏈接

Question 1

如果您想使用程式碼來執行此操作，您可以使用 LWP::Simple 或 Mechanize 模組在 Perl 中執行此操作。

以下可能有您想要的使用 LWP::Simple 模組從網頁找到所有鏈接

這是假設您熟悉使用 Perl 的命令列解決方案。這在 Windows 和 Linux 平台上都是相同的。將 URL 作為命令列參數進行解析不需要太多修改。

Answer

如果您想使用程式碼來執行此操作，您可以使用 LWP::Simple 或 Mechanize 模組在 Perl 中執行此操作。

以下可能有您想要的使用 LWP::Simple 模組從網頁找到所有鏈接

這是假設您熟悉使用 Perl 的命令列解決方案。這在 Windows 和 Linux 平台上都是相同的。將 URL 作為命令列參數進行解析不需要太多修改。

Question 2

是的，這是一個很好的 bash 腳本。這使用 lynx 瀏覽器從頁面中提取 URL 並將它們轉儲到文字檔案中：

#!/bin/bash
#
# Usage:
#
#   linkextract <start> <end> <pad> <url>
#
#   <start> is the first number in the filename range. Must be an integer
#   <stop> is the last number in the filename range. Must be an integer
#   <pad> is the number of digits the number in the filename is zero-padded to. 
#   <url> is the URL. Insert "<num>" where you want the number to appear. You'll
#         need to enclose the entire argument in quotes

for (( i=${1} ; i<=${2} ; i++ )); do {
    num=$(printf "%04d" ${i})
    url=$(echo ${4} | sed "s/<num>/${num}/")
    lynx -dump -listonly "${url}" | sed -r -n "/^ +[0-9]/s/^ +[0-9]+\. //p"
}; done

您需要安裝 lynx 瀏覽器，該瀏覽器在 Debian 上以「lynx」套件形式提供。該腳本將提取的 URL 列印到 stdout。因此，對於您問題中的範例，您將執行以下操作（假設您將腳本儲存到名為 linkextract 的檔案中）：

$ linkextract 1 329 3 "http://example.com/page<num>.html"

Answer

是的，這是一個很好的 bash 腳本。這使用 lynx 瀏覽器從頁面中提取 URL 並將它們轉儲到文字檔案中：

#!/bin/bash
#
# Usage:
#
#   linkextract <start> <end> <pad> <url>
#
#   <start> is the first number in the filename range. Must be an integer
#   <stop> is the last number in the filename range. Must be an integer
#   <pad> is the number of digits the number in the filename is zero-padded to. 
#   <url> is the URL. Insert "<num>" where you want the number to appear. You'll
#         need to enclose the entire argument in quotes

for (( i=${1} ; i<=${2} ; i++ )); do {
    num=$(printf "%04d" ${i})
    url=$(echo ${4} | sed "s/<num>/${num}/")
    lynx -dump -listonly "${url}" | sed -r -n "/^ +[0-9]/s/^ +[0-9]+\. //p"
}; done

您需要安裝 lynx 瀏覽器，該瀏覽器在 Debian 上以「lynx」套件形式提供。該腳本將提取的 URL 列印到 stdout。因此，對於您問題中的範例，您將執行以下操作（假設您將腳本儲存到名為 linkextract 的檔案中）：

$ linkextract 1 329 3 "http://example.com/page<num>.html"

Question 3

您可以使用網站視覺化爬蟲為了這項工作。下載並安裝它，然後單擊新專案，輸入您網站的 URL，按一下“確定”，然後開始抓取工具按鈕。

爬取完成後，雙擊所有連結的報告報告標籤。您將獲得網站上存在的所有連結以及其他資訊：來源/目標連結 URL、內容類型（HTML、圖像、pdf、css 等）、回應等。選擇所有表（上下文選單或 Ctrl+A 快捷鍵），然後按一下複製帶標題的行上下文選單項目。之後，您可以將資料貼到 Excel 工作表或簡單的文字文件中：

提取所有網站鏈接

該程式有 30 天的試用期，但功能齊全，因此您可以免費使用 1 個月。

Answer

您可以使用網站視覺化爬蟲為了這項工作。下載並安裝它，然後單擊新專案，輸入您網站的 URL，按一下“確定”，然後開始抓取工具按鈕。

爬取完成後，雙擊所有連結的報告報告標籤。您將獲得網站上存在的所有連結以及其他資訊：來源/目標連結 URL、內容類型（HTML、圖像、pdf、css 等）、回應等。選擇所有表（上下文選單或 Ctrl+A 快捷鍵），然後按一下複製帶標題的行上下文選單項目。之後，您可以將資料貼到 Excel 工作表或簡單的文字文件中：

提取所有網站鏈接

該程式有 30 天的試用期，但功能齊全，因此您可以免費使用 1 個月。

從網頁的數字範圍中提取鏈接

答案1

答案2

答案3

相關內容