尋找並刪除具有特定標點符號的重複行

Question 1

假設重複值是連續的！

一個可以完成這項工作的 Perl 腳本。

未在大文件上測試！

#!/usr/bin/perl
use strict;
use warnings;

my $file = 'file1'; # path to input file
# read the input file in memory
open my $F, '<', $file or die "unable to open '$file': $!";
my @list = <$F>;chomp @list;
# delete all . - _ from each line and add this new string in the array for comparison
my @res = map {my $tmp=$_; tr/._-//d; [lc$_,$tmp] } @list;
# memoize the first values
my $prev_tst = $res[0][0];  # contains the string without punctuation
my $prev_orig = $res[0][1]; # contains original string
# loop on other values
for my $ind (1 .. @res-1) {
    my ($tst, $orig) = ($res[$ind][0], $res[$ind][1]);
    # te string without punctuation is the same as the previous
    if ($tst eq $prev_tst) {
        # if the previous original value contains dot
        if ($prev_orig =~  tr/.//) {
            # delete it
            undef $res[$ind-1];
        # if the current original value contains dot
        } elsif ($orig =~ tr/.//) {
            # delete it
            undef $res[$ind];
        }
    }
    # memorize value for next step
    $prev_tst = $tst;
    $prev_orig = $orig;
}
# write result to result file
my $result = 'result_file'; # path to result file
open my $R, '>', $result or die "unable to open '$result': $!";

for (@res) {
    next unless defined $_; # skip undifned values
    print $R $_->[1],"\n";
}

Answer

假設重複值是連續的！

一個可以完成這項工作的 Perl 腳本。

未在大文件上測試！

#!/usr/bin/perl
use strict;
use warnings;

my $file = 'file1'; # path to input file
# read the input file in memory
open my $F, '<', $file or die "unable to open '$file': $!";
my @list = <$F>;chomp @list;
# delete all . - _ from each line and add this new string in the array for comparison
my @res = map {my $tmp=$_; tr/._-//d; [lc$_,$tmp] } @list;
# memoize the first values
my $prev_tst = $res[0][0];  # contains the string without punctuation
my $prev_orig = $res[0][1]; # contains original string
# loop on other values
for my $ind (1 .. @res-1) {
    my ($tst, $orig) = ($res[$ind][0], $res[$ind][1]);
    # te string without punctuation is the same as the previous
    if ($tst eq $prev_tst) {
        # if the previous original value contains dot
        if ($prev_orig =~  tr/.//) {
            # delete it
            undef $res[$ind-1];
        # if the current original value contains dot
        } elsif ($orig =~ tr/.//) {
            # delete it
            undef $res[$ind];
        }
    }
    # memorize value for next step
    $prev_tst = $tst;
    $prev_orig = $orig;
}
# write result to result file
my $result = 'result_file'; # path to result file
open my $R, '>', $result or die "unable to open '$result': $!";

for (@res) {
    next unless defined $_; # skip undifned values
    print $R $_->[1],"\n";
}

Question 2

類似的東西

sed 's/\./-/g; s/__*/-/g' /path/to/infile | sort -u > /path/to/outfile

應該能解決問題

Answer

類似的東西

sed 's/\./-/g; s/__*/-/g' /path/to/infile | sort -u > /path/to/outfile

應該能解決問題

尋找並刪除具有特定標點符號的重複行

答案1

假設重複值是連續的！

答案2

相關內容