EN / PT

Welcome!

Let's solve your technology problems!

My name is Marcelo L. Lotufo, I started progamming when I was 15 years old, so I could create my own maps in Warcraft 3 and my own games in Flash. I ended up falling in love with the craft and now I am a software engineer working at Lotuz, a small development agency.

Scraping a page with bash

11/10/2020

A friend asked my help to scrape a website for some data, so I decided to write a small bash script that would do the job.

You might ask why bash. The idea was to have the least amount of work to complete this script. As I use Linux, bash was ready for the job with the help of some simple other programs:

  • curl
  • grep
  • sed
  • printf
  • echo

Another reason to use bash instead of node.js or python is that I want to improve my knowledge of bash. It looks to me as a pretty useful tool.

# $# is the variable for arguments length
# If the length of arguments is less than 1
if [ "$#" -lt 1 ]; then
  # Teach how to use the script
  echo "Use script with at least one ticker. ex.:"
  echo "./miner.sh ticker1 [ticker2 [ticker3...]]"
  exit 1
fi

# Save all arguments into variable TICKERS
TICKERS=$@

# The page url to scrape
URL=https://something.com/something/

# Some headers for pretty printing our data
headers=( \
  "VAL P" \
  "DY CAGR 3a" \
  "AVG. 24m" \
)

# Print the headers
printf "%10s\t%10s\t%20s\t%20s" "TICKER" "${headers[0]}" "${headers[1]}" "${headers[2]}"
echo

# For each ticker passed as an argument
for TICKER in $TICKERS; do
  # Get the page text, pipe to...
  # find each line that has the text 'class="value"', pipe to...
  # use a regular expression to find only the content between ">" and "<" and
  # save each line as an item of the array $data
  data=($(curl -s $URL$TICKER | grep 'class="value"' | sed -En 's/[^>]*>([^<]*).*$/\1/p'))

  # print the data we wanted to scrape
  printf "%10s\t%10s\t%20s\t%20s" ${TICKER} ${data[5]} ${data[8]} ${data[11]}
  echo
done
exit 0

I bet there is a better way to pretty print this data, actually there should be better ways to do a lot of stuff. That grep | sed seems strange, buuuuut...

THE SCRIPT DID THE JOB!

Later on he asked if it was possible to "run" this inside a google spreadsheet, and I wrote an equivalent of this in gscript, but that will be for a later post.

Hope you enjoyed, keep coding!

marcel0ll