/media/bill/PROJECTS/System_maintenance/Linux/lists compare - diff & comm.txt
www.BillHowell.ca  03Dec2018 initial 

see also "/media/bill/SWAPPER/System_maintenance/Linux/comm - compare files line by line notes.txt"


***********
comm 	 Compare sorted files FILE1 and FILE2 line by line.
       When FILE1 or FILE2 (not both) is -, read standard input.
       With  no  options,  produce three-column output.  Column one contains lines unique to FILE1, column two con‐
       tains lines unique to FILE2, and column three contains lines common to both files.
       -1     suppress column 1 (lines unique to FILE1)
       -2     suppress column 2 (lines unique to FILE2)
       -3     suppress column 3 (lines that appear in both files)
[file1, file2] must be SORTED!!
To find lines of file1 not in file2 : 					diff file2 file1 | grep '^>' | sed 's#^>\ ##'
To find lines common to both file1 & file2 :		comm -12 file1 file2 

********************************************


***************
04Dec2018  Example - key_emails for INNS mass email list
/media/bill/ramdisk/key_emails notin Target diff.txt
This is more complex.  
Easiest to use "NN Projects, Assciation" sheet of "_contacts PROJECTS.ods" :
1.	Copy-paste organisational people  
	to		"/media/bill/PROJECTS/a_INNS Lexicom email server/key emails extraction/2_key_emails non-attendee full contacts.txt" 
2.	Extract "non-attendee" emails using  emailFolder_extract_addresses in "email - extract, sort, cull addresses from text.ndf" 
	emailFolder_extract_addresses  '/media/bill/PROJECTS/a_INNS Lexicom email server/key emails extraction/2_key_emails non-attendee full contacts.txt'  '/media/bill/ramdisk/3_key_emails non-attendee emails.txt' 
3. $ sort -u  "/media/bill/ramdisk/3_key_emails non-attendee emails.txt"  >"/media/bill/PROJECTS/a_INNS Lexicom email server/key emails extraction/3_key_emails non-attendee emails.txt"  
4. Create list of remaining key_emails to add to Targets, "INNS mass email list.ods"
	diff  <(sort  "/media/bill/PROJECTS/a_INNS Lexicom email server/key emails extraction/3_key_emails non-attendee emails.txt")  <(sort  "/media/bill/PROJECTS/a_INNS Lexicom email server/key emails extraction/0_key_emails_list.txt")  |  grep '^>' | sed 's/^>\ //'  >"/media/bill/PROJECTS/a_INNS Lexicom email server/key emails extraction/4_key_emails add to Targets.txt"  
>> very quick (<1 second?), looks correct for IJCNN2018 Organising Committee!
5. Add remaining key_emails to Target Sheet in :
	"/media/bill/Midas/a_INNS Lexicom email server/INNS mass email list.ods"


***************
03Dec2018  How do I efficently add key_emails to Targets, remove from Undeliverables]??
1. generate a list by copying "email" column of ALL Undeliverables, save to "/media/bill/ramdisk/email all targets.txt" 
2. Use bash to generate emails of concern (see stackoverflow.com below)
	files must be sorted!! : 

https://stackoverflow.com/questions/18204904/fast-way-of-finding-lines-in-one-file-that-are-not-in-another
diff  <(sort  "/media/bill/ramdisk/email all targets.txt")  <(sort  "/media/bill/PROJECTS/a_INNS Lexicom email server/key emails extraction/0_key_emails_list.txt")  |  grep '^>' | sed 's/^>\ //'  >"/media/bill/ramdisk/key_emails notin Target diff.txt"
>> very quick (<1 second?), looks correct for IJCNN2018 Organising Committee!

***************
03Dec2018  How do I efficently remove key_emails from Undeliverables]??


1. generate a list by copying "email" column of ALL Undeliverables, save to  "/media/bill/ramdisk/email all undeliverables.txt"
2. Use bash to generate emails of concern : : 
grep  "/media/bill/ramdisk/email all undeliverables.txt"  "/media/bill/PROJECTS/a_INNS Lexicom email server/key emails extraction/0_key_emails_list.txt"  |  grep '^<' | sed 's/^>\ //'  >"/media/bill/ramdisk/key_emails in Undeliverables grep.txt"

https://stackoverflow.com/questions/18204904/fast-way-of-finding-lines-in-one-file-that-are-not-in-another
#find lines common to both files
comm -12 file1 file2 
comm -12  <(sort  "/media/bill/ramdisk/email all undeliverables.txt")  <(sort  "/media/bill/PROJECTS/a_INNS Lexicom email server/key emails extraction/0_key_emails_list.txt")  >"/media/bill/ramdisk/key_emails in Undeliverables grep.txt"
>> very fast 

**************************
https://stackoverflow.com/questions/18204904/fast-way-of-finding-lines-in-one-file-that-are-not-in-another

Fast way of finding lines in one file that are not in another?
Ask Question
up vote
147
down vote
favorite
64

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.

For example, if this is file1:

line1
line2
line3

And this is file2:

line1
line4
line5

Then my result/output should be:

line2
line3

This works:

grep -v -f file2 file1

But it is very, very slow when used on my large files.

I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.

Can anyone help me find a fast way of doing this, using bash and basic linux binaries?

EDIT: To follow up on my own question, this is the best way I have found so far using diff():

diff file2 file1 | grep '^>' | sed 's/^>\ //'

Surely, there must be a better way?
bash grep find diff
shareimprove this question
edited Aug 13 '13 at 9:15
asked Aug 13 '13 at 9:05
Niels2000
8702813

add a comment
10 Answers
active
oldest
votes
up vote
147
down vote
accepted

You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.

Explanation

The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.

If you are familiar with unified diff format, you can partly recreate it with:

diff --old-line-format="-%L" --unchanged-line-format=" %L" \
     --new-line-format="+%L" file1 file2

The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u (note that it only outputs differences, it lacks the --- +++ and @@ lines at the top of each grouped change). You can also use this to do other useful things like number each line with %dn.

The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.

# output lines in file1 that are not in file2
BEGIN { FS="" }                         # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; }     # file1, index by lineno
(NR!=FNR) { ss2[$0]++; }                # file2, index by string
END {
    for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}

This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)

In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.

BEGIN { FS="" }
(NR==FNR) {  # file1, index by lineno and string
  ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) {  # file2
  if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
  for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}

The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.

In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:

split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1

Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.

For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.
shareimprove this answer
edited Apr 13 '17 at 12:36
Community♦
11
answered Aug 13 '13 at 9:24
mr.spuratic
6,45722521

show 2 more comments
up vote
146
down vote

The comm command (short for "common") may be useful comm - compare two sorted files line by line

#find lines only in file1
comm -23 file1 file2 

#find lines only in file2
comm -13 file1 file2 

#find lines common to both files
comm -12 file1 file2 

The man file is actually quite readable for this.
shareimprove this answer
edited Jun 18 at 3:35
answered Oct 28 '14 at 21:46
JnBrymn
11.6k2380131


# enddoc