Please note that LinuxExchange will be shutting down on December 31st, 2016. Visit this thread for additional information and to provide feedback.

I'm looking to "grep" a file looking for matches only if 2 or more of the keywords are found.

So, for example, I want to search a file for blah and foo but they are on separate lines of the file. I only want to know if the file contains both keywords. I realize grep may not be the solution for this; may need to write a script of some sort but grep or some equivalent would be much faster.

More Info:

I have a file that contains:

stuff blah explore
morestuff explore here
foo

I want to be able to search that file for to see if both blah and foo appear in the file.

asked 30 Dec '10, 17:12

Andy's gravatar image

Andy
2972920
accept rate: 14%

edited 30 Dec '10, 19:32

Please accept an answer so the question/answer can be finished. Or provide more details so we can help.

(20 Apr '11, 14:21) rfelsburg ♦



I love Python ... I couldn't resist writing you some code. It would be easy to generalize it to check all the files in a directory ... but I'm trying to hook you!

I am sure it could be done in a more Pythonic way, but python is great for string stuff -- way more readable than Perl. (Flame war alert!)

    #!/usr/bin/python
""" USAGE: python two_tags.py data.txt target1 target2
    Scans for two arbitrary text strings
    $ python two_tags.py data.txt gosh golly
    Search 6 lines -- gosh NOT FOUND., golly FOUND! 

    python two_tags.py data.txt golly howdy
    Search 2 lines -- golly FOUND!, howdy FOUND! 

           """
#Sample file data.txt for testing code
#asdfs fsdf howdy  a ha b
#safd golly gee
#whiz
#zard
#spam
#golly

import string
import sys
args=sys.argv
fn=args[1]  #This contains the filename you passed on the command line
Input_file=fn
print args  #For debug
target1=args[2]
target2=args[3]
f=open(Input_file,'r')
line_count=0
target1_found='NOT FOUND.'
target2_found='NOT FOUND.'
for line in f: 
   line_count+=1
   x=string.find(line, target1)
   if x>-1:
       target1_found='FOUND!'
   x=string.find(line, target2)
   if x>-1:
       target2_found='FOUND!'
   if (target1_found=='FOUND!') and (target2_found=='FOUND!'):
       break

f.close()
print("Search %i lines -- %s %s, %s %s\n" %  (line_count, target1, target1_found, target2, target2_found))
link

answered 29 Jan '11, 07:03

pcardout's gravatar image

pcardout
226239
accept rate: 46%

edited 29 Jan '11, 07:09

perl -e '$a=`cat file.txt`;if (grep(/foo/,$a)&&(grep(/blah/,$a))){print "YES\n"}else{print "NO\n"}';

NOTE1: back-tics do not show up on this site. Should be BACK-TIC cat file.txt BACK-TIC (correction - back-tics do show up if escaped)

NOTE2: And file.txt is from the above example:

stuff blah explore
morestuff explore here
foo
link

answered 10 Jan '11, 12:49

joe%203's gravatar image

joe 3
11
accept rate: 0%

edited 10 Jan '11, 12:55

perl -e '$a=cat file.txt;if (grep(/foo/,$a)&&(grep(/blah/,$a))){print "YES\n"}else{print "NO\n"}';

link

answered 07 Jan '11, 20:15

joe%202's gravatar image

joe 2
111
accept rate: 0%

If instead of words you use patterns, simple chain of xarg grep 's will work:

grep -lZe 'pattern1' files... | xargs -0 grep -lZe 'pattern2' | xargs -0 grep -le 'pattern3'

The -lZ flags will tell grep to output the name of each file with a match, followed by \0. xargs -0 executes the specified command for each filename it reads, separated by \0's. Using \0 makes sure all file names, even those with spaces or newlines, are handled correctly. The final grep does not have the -Z flag, so its output will be one file name per line.

Basic idea is to first scan all files for the first pattern, and output the names of the files that did have a match. This list is then fed to the second grep, which only outputs the names of those files that did have a match on the second pattern (as well as the first). This is repeated for each pattern. The last grep will then output the names of the files that matched all patterns.

This is extremely efficient, especially if you order the patterns in ascending likelihood to be found in the files. In practice, ordering the patterns by length (longest first) works almost as well.

Note that you can prepend find to the chain by using the -print0 flag for find, i.e.

find . -name '*.txt' -print0 | xargs -0 grep -lZe 'pattern1' | xargs -0 grep -lZe 'pattern2' | xargs -0 grep -le 'pattern3'
link

answered 31 Dec '10, 17:55

Nominal%20Animal's gravatar image

Nominal Animal
461
accept rate: 50%


cat my_text_file.txt | tr -c a-zA-z '\n' | sed '/^$/d' | sort | uniq -i -c

EDIT TO ANSWER QUESTION IN COMMENT/REPLY SECTION

We kick things off with the cat command and give it the name of the file we want to examine. The cat command then passes the contents of our text file to "tr". The tr command breaks up the file, putting each word on its own line, for easy access. (The '\n' after "tr" indicates we want to add newline characters to our text.) We next filter our file through the sed command, which removes any empty lines. (The ^ immediately followed by the $ mean we're looking for lines that effectively have nothing between the beginning of the line and the end. The "d" on the end of the sed command indicates we want to delete any such lines.) The list of words we have is sorted alphabetically and then passed to the "uniq" command, which performs the actual count for us. Should we want to narrow things down so we just see the count for the word "love" we can append the grep program to our command in this manner:


cat my_text_file.txt | tr -c a-zA-z '\n' | sed '/^$/d' | sort | uniq -i -c | grep -i \ love$

link

answered 30 Dec '10, 17:40

Ron's gravatar image

Ron ♦
9361718
accept rate: 13%

edited 30 Dec '10, 20:17

Ron, thanks for the tip! I must admit I'm not a great sed user yet but am learning. Where in the line would I add blah and foo? Or would it be in the tr section? Thanks!

(30 Dec '10, 18:35) Andy

I'll post another answer since I need more room than is allotted in this reply space. Look there.

(30 Dec '10, 20:16) Ron ♦
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×16
×1
×1

Asked: 30 Dec '10, 17:12

Seen: 2,766 times

Last updated: 20 Apr '11, 14:21

powered by OSQA