The Who

I’m your friendly neighborhood unix man. And I have a problem. I love regex. Now I have two problems.

The Why

Recently I have become destitute so I have taken to using the skills I have to provide for my poor sick family. It’s mostly disabled cats. And when all you have is hammer, everything gets nailed. So I thought, you know what people probably don’t automate enough of…Data Entry. I logged onto a freelance site of my choice, upwork. And found exactly what I wanted. Someone who wanted to turn a PDF into a csv. Particularly, a parts catalog into a shopify CSV.

The Good

The pdf was easy enough to convert into text, originally I thought I would convert all the pages into tiff’s and use tesseract to OCR the characters. As it turned out this PDF was all selectable text, and just using pdftotext was sufficient.

For Mac:

brew install pdftotext
brew install tesseract

All and all when this hideous brain child was finished, I had about an 87 percent success rate.

The Bad

There are things I regret, and things I regret. I am ashamed to admit this but one of the earlier things I did was strip all the quotes out of the resulting pdftotext file. Which was mostly so I could clean up the commas using this function:

function rqc () { awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' $@ | sed 's/"//g' ;}

This functions basically removes all the commas in between quotes in a csv, then removes the quotes. Yeah, for sure, there were better ways to deal with that. But it’s not like I have a peer group of other regexers at beck and call.

The Ugly

Let’s look at the code:

set -o nounset                              # Treat unset variables as an error
BROAD=$(grep '[A-Z]\.' c.txt -A2 -B9)
#List of all Skus
#LIST=$(grep -P -o '\d{5}-\d{2}\w{0,1}' c.txt | sort | uniq )
LIST=$( grep -o '^[0-9]\{7,8\}' c.txt | sort | uniq  )
#Temp for checking if adjacent price
TEMP=0
for n in $(cat <(echo $LIST) )
do
    #Get Name
    NAME=$( grep "$n" c.txt  -B9 | grep '^[A-Z]\.' | sed 's/^[A-Z]\. //' | sed 's/^[A-Z ]\+       \([A-Z][a-z]\)/\1/' | sed 's/"//g' | grep '[A-Z ]\+' | tail -n1 | sed 's/[a-z].*//' | sed 's/.$//' )
    #Get Body
    DESCRIPTION=$( grep "$n" c.txt  -B9 | grep '^[A-Z]\.' | sed 's/^[A-Z]\. //' | sed 's/^[A-Z ]\+\([A-Z][a-z]\)/\1/' | sed 's/"//g')

    if [[  -z $NAME ]]
    then
        NAME=$( grep $n c.txt  -B10 | grep '^[A-Z]\+' | head -n1 | sed 's/^[A-Z]\. //' | sed 's/[A-Z][a-z].*//g')
    fi
    #Get Handle
    HANDLE=$( echo $NAME | tr '[A-Z]' '[a-z]' | sed 's/ /-/g' | sed "s/$/-${n}/"| sed 's/[A-Z][a-z]//' )

    #Get Price
    PRICE=$(grep "$n" c.txt -A6 | grep -P '${n} $\d+\.\d+' -o | grep -P '\d+\.\d+' -o)
    if [[  -z $PRICE ]]
    then
        PRICE=$(grep "$n" c.txt -A6 | grep -P -o '\$\d+\.\d+' | sed 's/\$//g' |tr ' ' '\n' | head -n1)
    fi
    # Do Var Price Math
    FIFTEEN=$( echo "scale=2; (${PRICE}) * .15" | bc )
    VARPRICE=$( echo "scale=2; ${PRICE} - $FIFTEEN"| bc )
    # Add Sku to Name
    NAME=$(echo $NAME | sed "s/$/ - ${n}/" )
    if [[ ! -z $n ]]
    then
        FINAL=$(echo "\"$HANDLE\", \"$NAME\" , \"$DESCRIPTION\", \"Vendor\", \"\" , , \"\", \"Title\", \"Default Title\", , , , , \"$n\", ,,,,,\"$VARPRICE\",\"$PRICE\"")
    fi
    if [[ ! -z $FINAL ]]
    then
        echo $FINAL
    fi
done

A note about BSD tools vs GNU tools

I started doing this on my macbook air, Mac OS X uses the bsd cousins of grep and sed. More and more, for various reasons I saw myself using ggrep and gsed, rather than grep and sed. Mostly because of how the BSD sed deals with new lines, and because on occasion I wanted to use PCRE regex with grep, ie the -P flag. Eventually I fully switched to a linux enviroment, using the GNU toolset. The final straw actually had nothing to do with the tools, the resolution\screen space on MacBook Air, didn’t allow me to see the lines the way I wanted them in a vimdiff. So I switched to a debian system in the final rounds of the script.

Hey Guys Don’t Forget about my poor cats

My cats, they are blind, they are missing limbs. Please purchase a book about regex Or a book about successful coders