marcel0llEN | PT

Scraping a page with bash


A friend asked my help to scrape a website for some data, so I decided to write a small bash script that would do the job.

You might ask why bash. The idea was to have the least amount of work to complete this script. As I use Linux, bash was ready for the job with the help of some simple other programs:

  • curl
  • grep
  • sed
  • printf
  • echo

Another reason to use bash instead of node.js or python is that I want to improve my knowledge of bash. It looks to me as a pretty useful tool.

# $# is the variable for arguments length
# If the length of arguments is less than 1
if [ "$#" -lt 1 ]; then
  # Teach how to use the script
  echo "Use script with at least one ticker. ex.:"
  echo "./ ticker1 [ticker2 [ticker3...]]"
  exit 1

# Save all arguments into variable TICKERS

# The page url to scrape

# Some headers for pretty printing our data
headers=( \
  "VAL P" \
  "DY CAGR 3a" \
  "AVG. 24m" \

# Print the headers
printf "%10s\t%10s\t%20s\t%20s" "TICKER" "${headers[0]}" "${headers[1]}" "${headers[2]}"

# For each ticker passed as an argument
for TICKER in $TICKERS; do
  # Get the page text, pipe to...
  # find each line that has the text 'class="value"', pipe to...
  # use a regular expression to find only the content between ">" and "<" and
  # save each line as an item of the array $data
  data=($(curl -s $URL$TICKER | grep 'class="value"' | sed -En 's/[^>]*>([^<]*).*$/\1/p'))

  # print the data we wanted to scrape
  printf "%10s\t%10s\t%20s\t%20s" ${TICKER} ${data[5]} ${data[8]} ${data[11]}
exit 0

I bet there is a better way to pretty print this data, actually there should be better ways to do a lot of stuff. That grep | sed seems strange, buuuuut...


Later on he asked if it was possible to "run" this inside a google spreadsheet, and I wrote an equivalent of this in gscript, but that will be for a later post.

Hope you enjoyed, keep coding!


Do you have questions? Just wanna chat? Send me an email: