Scraping a page with bash
11/10/2020A friend asked my help to scrape a website for some data, so I decided to write a small bash script that would do the job.
You might ask why bash. The idea was to have the least amount of work to complete this script. As I use Linux, bash was ready for the job with the help of some simple other programs:
- curl
- grep
- sed
- printf
- echo
Another reason to use bash instead of node.js or python is that I want to improve my knowledge of bash. It looks to me as a pretty useful tool.
# $# is the variable for arguments length
# If the length of arguments is less than 1
if [ "$#" -lt 1 ]; then
# Teach how to use the script
echo "Use script with at least one ticker. ex.:"
echo "./miner.sh ticker1 [ticker2 [ticker3...]]"
exit 1
fi
# Save all arguments into variable TICKERS
TICKERS=$@
# The page url to scrape
URL=https://something.com/something/
# Some headers for pretty printing our data
headers=( \
"VAL P" \
"DY CAGR 3a" \
"AVG. 24m" \
)
# Print the headers
printf "%10s\t%10s\t%20s\t%20s" "TICKER" "${headers[0]}" "${headers[1]}" "${headers[2]}"
echo
# For each ticker passed as an argument
for TICKER in $TICKERS; do
# Get the page text, pipe to...
# find each line that has the text 'class="value"', pipe to...
# use a regular expression to find only the content between ">" and "<" and
# save each line as an item of the array $data
data=($(curl -s $URL$TICKER | grep 'class="value"' | sed -En 's/[^>]*>([^<]*).*$/\1/p'))
# print the data we wanted to scrape
printf "%10s\t%10s\t%20s\t%20s" ${TICKER} ${data[5]} ${data[8]} ${data[11]}
echo
done
exit 0
I bet there is a better way to pretty print this data, actually there should be better ways to do a lot of stuff. That grep | sed seems strange, buuuuut...
THE SCRIPT DID THE JOB!
Later on he asked if it was possible to "run" this inside a google spreadsheet, and I wrote an equivalent of this in gscript, but that will be for a later post.
Hope you enjoyed, keep coding!
marcel0ll