.bib ファイル内の重複した記事タイトルを見つける

Question 1

perlを使用すると、bib ファイルを調べ、すべてのタイトルをハッシュキーとして保存し、その行をハッシュ値として保存し、ループして、その値に複数のエントリがある場合はタイトルを印刷することができます。これを行うには、次の内容のファイル (例: "finddupls.pl") を作成し、bib ファイル名を変更して、perl finddupls.plターミナルで実行します。

#!perl
my %seen = ();

my $line = 0;
open my $B, 'file.bib';
while (<$B>) {
    $line++;
    # remove all non-alphanumeric characters, because bibtex could have " or { to encapsulate strings etc
    s/[^a-zA-Z0-9 _-]//ig; 
    # lower-case everything to be case-insensitive
    # pattern matches lines which start with title
    $seen{lc($1)} .= "$line," if /^\s*title\s*(.+)$/i;
}
close $B;

# loop through the title and count the number of lines found
foreach my $title (keys %seen) {
    # count number of elements seperated by comma
    my $num = $seen{$title} =~ tr/,//;
    print "title '$title' found $num times, lines: ".$seen{$title},"\n" if $num > 1;
}

# write sorted list into file
open my $S, '>sorted_titles.txt';
print $S join("\n", sort keys %seen);
close $S;

ターミナルに次のような内容が直接返されます:

title 'observation on soil moisture of irrigation cropland by cosmic-ray probe' found 2 times, lines: 99,1350,
title 'multiscale and multivariate evaluation of water fluxes and states over european river basins' found 2 times, lines: 199,1820,
title 'calibration of a non-invasive cosmic-ray probe for wide area snow water equivalent measurement' found 2 times, lines: 5,32,

sorted_titles.txtさらに、すべてのタイトルをアルファベット順にリストしたファイルも作成され、手動で確認して重複を検出できます。

Answer

perlを使用すると、bib ファイルを調べ、すべてのタイトルをハッシュキーとして保存し、その行をハッシュ値として保存し、ループして、その値に複数のエントリがある場合はタイトルを印刷することができます。これを行うには、次の内容のファイル (例: "finddupls.pl") を作成し、bib ファイル名を変更して、perl finddupls.plターミナルで実行します。

#!perl
my %seen = ();

my $line = 0;
open my $B, 'file.bib';
while (<$B>) {
    $line++;
    # remove all non-alphanumeric characters, because bibtex could have " or { to encapsulate strings etc
    s/[^a-zA-Z0-9 _-]//ig; 
    # lower-case everything to be case-insensitive
    # pattern matches lines which start with title
    $seen{lc($1)} .= "$line," if /^\s*title\s*(.+)$/i;
}
close $B;

# loop through the title and count the number of lines found
foreach my $title (keys %seen) {
    # count number of elements seperated by comma
    my $num = $seen{$title} =~ tr/,//;
    print "title '$title' found $num times, lines: ".$seen{$title},"\n" if $num > 1;
}

# write sorted list into file
open my $S, '>sorted_titles.txt';
print $S join("\n", sort keys %seen);
close $S;

ターミナルに次のような内容が直接返されます:

title 'observation on soil moisture of irrigation cropland by cosmic-ray probe' found 2 times, lines: 99,1350,
title 'multiscale and multivariate evaluation of water fluxes and states over european river basins' found 2 times, lines: 199,1820,
title 'calibration of a non-invasive cosmic-ray probe for wide area snow water equivalent measurement' found 2 times, lines: 5,32,

sorted_titles.txtさらに、すべてのタイトルをアルファベット順にリストしたファイルも作成され、手動で確認して重複を検出できます。

Question 2

フィールドが同一であることを信頼できる場合はtitle、非常に簡単です。

grep -n 'title =' bibliography.bib | uniq -cdf 1

これにより、ファイル内の一意でない行 ( -d) とそれらの出現回数 ( )、および参考文献ファイル内のそれらの出現行番号 ( ) のみが出力されます。は、この行番号である最初のフィールドを無視するように指示します。-cbibliography.bib-n-f 1uniq

したがって、次のような行が表示された場合:

     2 733:  title =    {Ethica Nicomachea},

が 2 回出現しtitle = {Ethica Nicomachea},、最初の出現が.bibファイルの 733 行目にあることがわかります。

Answer

フィールドが同一であることを信頼できる場合はtitle、非常に簡単です。

grep -n 'title =' bibliography.bib | uniq -cdf 1

これにより、ファイル内の一意でない行 ( -d) とそれらの出現回数 ( )、および参考文献ファイル内のそれらの出現行番号 ( ) のみが出力されます。は、この行番号である最初のフィールドを無視するように指示します。-cbibliography.bib-n-f 1uniq

したがって、次のような行が表示された場合:

     2 733:  title =    {Ethica Nicomachea},

が 2 回出現しtitle = {Ethica Nicomachea},、最初の出現が.bibファイルの 733 行目にあることがわかります。

.bib ファイル内の重複した記事タイトルを見つける

答え1

答え2

関連情報