Monday, January 20, 2014

Finding Multiple Needles in a Haystack

Today I needed to find out if a JSON config file contained all the cases it needed to cover. I extracted the cases from a spreadsheet (a list of numbers, really), saved them in a text file, and ran the two through the following script. It reported the four cases not covered in the config file, much easier than me manually comparing two lists.

#!/usr/bin/python
# script to make sure all words in the first file are present in the second

import sys

if len(sys.argv) != 3:
    print 'Usage:', sys.argv[0], '<needles>', '<haystack>'
    sys.exit(0)

# load the haystack
with open(sys.argv[2], 'r') as h:
    haystack = h.read()

# iterate through the needles
with open(sys.argv[1], 'r') as n:
    for line in n:
        line = line.strip()
        if not line or line.startswith('#'):
            continue
        if line not in haystack:
            print line


After I did this, I asked a colleague if there was a Unix one-liner to do the same thing. There isn't really, but you can use the comm command to get the difference between two sets, and using the numbers from the JSON file (extracted with jq), and the cases from the spreadsheet, both sorted and uniqued [1], to achieve the same result. It might be doable in a line or two, but composing and debugging a complex command like that might take longer than writing that tiny Python program. YMMV.

[1] Or just sort -u.

0 Comments:

Post a Comment

<< Home