使用 sed 過濾 RSS 項目

Question

正如評論中所建議的，我嘗試使用它xmlstarlet來解決這個問題並且效果很好。這是我的腳本

xml ed -d '//item[not(contains(title,"Project Foo"))]' < sample_rss.xml

我們假設提要內容位於文件中sample_rss.xml。該內容將被輸入到中xml ed -d，從而刪除與給定 XPath 表達式相符的任何註釋。 XPath 表達式會尋找任何<item> 不具有<title>包含文字的節點的節點"Project Foo"。

這看起來效果很好，我對執行時間也很滿意：

real    0m0.003s
user    0m0.001s
sys     0m0.002s

當心命名空間

如果您想要使用適當的 rss 或 Atom 提要來實現此功能，您可能會注意到其中feed包含 XML 命名空間 ( xmlns) 屬性，就像 YouTube 中的範例一樣：

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns:yt="http://www.youtube.com/xml/schemas/2015" xmlns:media="http://search.yahoo.com/mrss/" xmlns="http://www.w3.org/2005/Atom">
   ...
</feed>

那麼，上面的腳本將不再起作用！修復它讓我相當頭疼，但以下是讓它發揮作用的方法：

xml ed -d '//_:entry[not(contains(_:title,"Project Foo"))]' < youtube_rss.xml

有關此命名空間問題的更多信息，請參見此處：http://xmlstar.sourceforge.net/doc/UG/ch05.html

Answer 1