Is there a way to search a file looking for two or more words that appear in the file?

Question

I'm looking to "grep" a file looking for matches only if 2 or more of the keywords are found.

So, for example, I want to search a file for blah and foo but they are on separate lines of the file. I only want to know if the file contains both keywords. I realize grep may not be the solution for this; may need to write a script of some sort but grep or some equivalent would be much faster.

More Info:

I have a file that contains:

stuff blah explore
morestuff explore here
foo

I want to be able to search that file for to see if both blah and foo appear in the file.

Answer 1

I love Python ... I couldn't resist writing you some code. It would be easy to generalize it to check all the files in a directory ... but I'm trying to hook you!

I am sure it could be done in a more Pythonic way, but python is great for string stuff -- way more readable than Perl. (Flame war alert!)

    #!/usr/bin/python
""" USAGE: python two_tags.py data.txt target1 target2
    Scans for two arbitrary text strings
    $ python two_tags.py data.txt gosh golly
    Search 6 lines -- gosh NOT FOUND., golly FOUND! 

    python two_tags.py data.txt golly howdy
    Search 2 lines -- golly FOUND!, howdy FOUND! 

           """
#Sample file data.txt for testing code
#asdfs fsdf howdy  a ha b
#safd golly gee
#whiz
#zard
#spam
#golly

import string
import sys
args=sys.argv
fn=args[1]  #This contains the filename you passed on the command line
Input_file=fn
print args  #For debug
target1=args[2]
target2=args[3]
f=open(Input_file,'r')
line_count=0
target1_found='NOT FOUND.'
target2_found='NOT FOUND.'
for line in f: 
   line_count+=1
   x=string.find(line, target1)
   if x>-1:
       target1_found='FOUND!'
   x=string.find(line, target2)
   if x>-1:
       target2_found='FOUND!'
   if (target1_found=='FOUND!') and (target2_found=='FOUND!'):
       break

f.close()
print("Search %i lines -- %s %s, %s %s\n" %  (line_count, target1, target1_found, target2, target2_found))

Answer 2

0

perl -e '$a=`cat file.txt`;if (grep(/foo/,$a)&&(grep(/blah/,$a))){print "YES\n"}else{print "NO\n"}';

NOTE1: back-tics do not show up on this site. Should be BACK-TIC cat file.txt BACK-TIC (correction - back-tics do show up if escaped)

NOTE2: And file.txt is from the above example:

stuff blah explore
morestuff explore here
foo

link

answered 10 Jan '11, 12:49

joe 3
1●1
accept rate: 0%

edited 10 Jan '11, 12:55

Answer 3

1	perl -e '$a=`cat file.txt`;if (grep(/foo/,$a)&&(grep(/blah/,$a))){print "YES\n"}else{print "NO\n"}'; link answered 07 Jan '11, 20:15 joe 2 11●1 accept rate: 0%

Answer 4

If instead of words you use patterns, simple chain of xarg grep 's will work:

grep -lZe 'pattern1' files... | xargs -0 grep -lZe 'pattern2' | xargs -0 grep -le 'pattern3'

The -lZ flags will tell grep to output the name of each file with a match, followed by \0. xargs -0 executes the specified command for each filename it reads, separated by \0's. Using \0 makes sure all file names, even those with spaces or newlines, are handled correctly. The final grep does not have the -Z flag, so its output will be one file name per line.

Basic idea is to first scan all files for the first pattern, and output the names of the files that did have a match. This list is then fed to the second grep, which only outputs the names of those files that did have a match on the second pattern (as well as the first). This is repeated for each pattern. The last grep will then output the names of the files that matched all patterns.

This is extremely efficient, especially if you order the patterns in ascending likelihood to be found in the files. In practice, ordering the patterns by length (longest first) works almost as well.

Note that you can prepend find to the chain by using the -print0 flag for find, i.e.

find . -name '*.txt' -print0 | xargs -0 grep -lZe 'pattern1' | xargs -0 grep -lZe 'pattern2' | xargs -0 grep -le 'pattern3'

Answer 5


cat my_text_file.txt | tr -c a-zA-z '\n' | sed '/^$/d' | sort | uniq -i -c

EDIT TO ANSWER QUESTION IN COMMENT/REPLY SECTION

We kick things off with the cat command and give it the name of the file we want to examine. The cat command then passes the contents of our text file to "tr". The tr command breaks up the file, putting each word on its own line, for easy access. (The '\n' after "tr" indicates we want to add newline characters to our text.) We next filter our file through the sed command, which removes any empty lines. (The ^ immediately followed by the $ mean we're looking for lines that effectively have nothing between the beginning of the line and the end. The "d" on the end of the sed command indicates we want to delete any such lines.) The list of words we have is sorted alphabetically and then passed to the "uniq" command, which performs the actual count for us. Should we want to narrow things down so we just see the count for the word "love" we can append the grep program to our command in this manner:


cat my_text_file.txt | tr -c a-zA-z '\n' | sed '/^$/d' | sort | uniq -i -c | grep -i \ love$

Please note that LinuxExchange will be shutting down on December 31st, 2016. Visit this thread for additional information and to provide feedback.

Is there a way to search a file looking for two or more words that appear in the file?

Follow this question

Related questions