tips

Check for fake googlebot scrapers

I noticed a bot scraping using fake GoogleBot useragent string.

Here is a one liner that can detect the IPs to ban:

$ awk 'tolower($0) ~ /googlebot/ {print $1}' /var/www/httpd/access_log | grep -v 66.249.71. | sort | uniq -c | sort -n

It does a case-insensitive awk search for keyword "googlebot" from apache log file removing IPs with "66.249.71." which belongs to google and prints the output in a sorted hit count.

You can validate the IPs with:

IP=66.249.71.37 ; reverse=$(dig -x $IP +short | grep googlebot.com) ; ip=$(dig $reverse +short) ; [ "$IP" = "$ip" ] && echo $IP GOOD || echo $IP FAKE

Replace the IP value with the one you want to check.

ssh keygen RSA versus DSA

While generating ssh keys, I usually use RSA type since it can be used to generate 2048 bits key, while DSA is restricted to exactly 1024 bits.

ssh-keygen -t rsa -b 2048

Week of Month

Here is a simple one liner to get the week of month via awk from a `cal` output:

$ cal | awk -v date="`date +%d`" '{ for( i=1; i <= NF ; i++ ) if ($i==date) { print FNR-2} }'

highlight grep search string

Default to highlighting search string when using grep by adding the below alias to ~/.bashrc file:

alias grep='grep --color=auto'

analog filesize limit

I had some trouble with analog monthly stats not showing up for the last week and figured out that analog refuses to parse huge log files. I had one sitting at 3GB without being rotated and analog would error out with:

/usr/bin/analog: Warning F: Failed to open logfile
  /var/log/httpd/access_log: ignoring it

After running gzip on the log file, analog was then able to produce the reports. I think I read somewhere that the limit may be 2GB but have not tested this.

Syndicate content
Comment