Como obter a seção principal de um site usando os comandos curl e grep?

Question 1

É um pouco mais complexo do que o que você está tentando fazer.

Primeiro, existem alguns problemas de sintaxe no seu comando. Isso curl www.hackthissite.org: grep "<head> > ~/data/public/myfirstname\ mylastname/head.txtdeve ser:

curl www.hackthissite.org | grep "<head>" > ~/data/public/myfirstname\ mylastname/head.txt

Mas mesmo que não faça o que você deseja, porque você está apenas digitando a tag de abertura principal, mas não o que está entre ela e a tag de fechamento.

Eu descobri isso:

curl www.hackthissite.org > TEMPORARYFILE.txt; grep -A $(($(grep -n "</head>" TEMPORARYFILE.txt | cut -d: -f1) - $(grep -n "<head>" TEMPORARYFILE.txt | cut -d: -f1))) "<head>" TEMPORARYFILE.txt > ~/data/public/myfirstname\ mylastname/head.txt; rm TEMPORARYFILE.txt

Então, por partes:

grep -n "</head>" TEMPORARYFILE.txt | cut -d: -f1

Isso obterá o número da linha onde está a tag de fechamento. O mesmo se aplica a grep -n "<head>" TEMPORARYFILE.txt | cut -d: -f1, mas para a tag de abertura.

Então temos $(($(grep -n "</head>" TEMPORARYFILE.txt | cut -d: -f1) - $(grep -n "<head>" TEMPORARYFILE.txt | cut -d: -f1))), que supostamente calcula quantas linhas existem entre a tag de abertura e a tag de fechamento.

Isso é usado com a -Aopção grep, que nos dá controle de quantas linhas após a correspondência queremos imprimir. Portanto, ele irá procurar a tag head de abertura e imprimir todas as linhas entre ela e a tag de fechamento.

Answer