How to find the common paths from a list of paths/files

How to find the common paths from a list of paths/files

Auftakt:

Wie lassen sich bei einer sortierten Eingabe einer Liste von Pfaden/Dateien die gemeinsamen Pfade finden?

Um es in einen technischen Begriff zu übersetzen: Wie wählt man das kürzeste richtige Präfix aus der Standardeingabe aus, wenn die sortierte Eingabe von der Standardeingabe erfolgt?

Hier hat das „Präfix“ die normale Bedeutung, z. B. hat die Zeichenfolge „abcde“ das Präfix „abc“. Hier ist meine Beispieleingabe

$ echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2'
/home/dave
/home/dave/file1
/home/dave/sub2/file2

Dies ist ein Beispiel fürEntfernen Sie aufeinanderfolgende richtige Präfixevon der Standardeingabe aus mit dem folgenden Befehl sed:

$ echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2' | sed "N; /^\(.*\)\n\1\//D; P; D" 
/home/dave/file1
/home/dave/sub2/file2

Frage:

Meine Frage ist, wie ichBewahren Sie das richtige Präfixstattdessen und entfernen Sie alle Zeilen mit diesem Präfix. Wenn sowohl /home/dave/file1als auch /home/dave/sub2/file2das Präfix haben /home/dave, /home/davebleibt das erhalten, während die anderen beiden nicht erhalten bleiben. D. h., es wird das genaue Gegenteil des obigen sedBefehls bewirkt.

Mehr Info:

  • Die Eingabe wäre bereits sortiert
  • /home/dave /home/dave/file1 /home/phil /home/phil/file2Wenn ich ( ) habe echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2\n/home/phil\n/home/phil/file2', würde ich erwarten, dass /home/daveund /home/phildie Antwort ist.

Anwendung:

I have two disk volumes containing similiar content. I want to copy what's in v1 but missing from v2 into another disk volume, v3. Using find, sort, and comm, I am able to get a list of what to copy, but I need to further clean up that list. I.e., as long as I have /home/dave in the list, I don't need the other two.

Thanks!

Antwort1

This answer uses Python. As the OP wanted to remove the directories covered by their parents as I had seen as a possiblity I began writing a different program to remove coverings:

Example:

$ echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2\n/home/phil\n/home/phil/file1' | removecoverings 
/home/phil
/home/dave

Code of the removecoverings command:

#!/usr/bin/env python2

import sys

def list_startswith(a, b):
    if not len(a) >= len(b):
        return False
    return all(x == y for x,y in zip(a[:len(b)],b))

def removecoverings(it):
    g = list(it)
    g.sort(key=lambda v: len(v.split('/')), reverse=True)
    o = []
    while g:
        c = g.pop()
        d = []
        for v in g:
            if list_startswith(v.split('/'), c.split('/')):
                d.append(v)
        for v in d:
            g.remove(v)
        o.append(c)
    return o

for o in removecoverings(l.strip() for l in sys.stdin.readlines()):
    print o

This answer uses Python. It also does a component-wise rather than string-wise common prefix. Better for paths as the common prefix of /ex/ample and /exa/mple should be / not /ex. This assumes that what is wanted is the greatest common prefix and not a list of prefixes with their coverings removed. If you have /home/dave /home/dave/file1 /home/phil /home/phil/file2 and expect /home/dave /home/phil rather than /home. This is not the answer you would be looking for.

Example:

$ echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2' | commonprefix 
/home/dave

Code of the commonprefix command:

#!/usr/bin/env python2

import sys

def commonprefix(l):
    # this unlike the os.path.commonprefix version
    # always returns path prefixes as it compares
    # path component wise
    cp = []
    ls = [p.split('/') for p in l]
    ml = min( len(p) for p in ls )

    for i in range(ml):

        s = set( p[i] for p in ls )         
        if len(s) != 1:
            break

        cp.append(s.pop())

    return '/'.join(cp)

print commonprefix(l.strip() for l in sys.stdin.readlines())

Antwort2

Given that the input is sorted, The pseudo code would be:

$seen = last_line;
if current_line begins exactly as $seen then next
else { output current_line; $seen = current_line }

Translating into Perl code (Yes Perl, the most beautiful script language of all):

perl -e '
my $l = "\n";
while (<>) {
    if ($_ !~ /^\Q$l/) {
        print;
        chomp;
        $l = $_;
    }
}
'

Credit: Ben Bacarisse @bsb.me.uk, from comp.lang.perl.misc. Thanks Ben, it works great!

Antwort3

And, the one liner version of xpt's answer. Again, assuming sorted input:

perl -lne 'BEGIN { $l="\n"; }; if ($_ !~ /^\Q$l/) { print $_; $l = $_; }'

Run on the example input

/home/dave
/home/dave/file1
/home/dave/sub2/file2
/home/phil
/home/phil/file2 

using

echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2\n/home/phil\n/home/phil/file2' | perl -lne 'BEGIN { $l="\n"; }; if ($_ !~ /^\Q$l/) { print $_; $l = $_; }'

gives

/home/dave
/home/phil

The magic is in the command-line arguments to perl: -e allows us to give a script on the command line, -n iterates over the lines of the file (placing each line in $_), and -l deals with newlines for us.

The script works by using l to track the last prefix seen. The BEGIN block is run before the first line is read, and initializes the variable to a string that won't be seen (no newlines). The conditional is run on each line of the file (held by $_). The conditional is executed on all of the lines of the file, and says "if the line does not have the current value of l as a prefix, then print the line and save it as the value of l." Because of the command-line arguments, this is essentially identical to the other script.

The catch is that both scripts assume that the common prefix exists as its own line, so don't find the common prefix for input like

/home/dave/file1
/home/dave/file2

verwandte Informationen