introduction to python list comprehensions

When I first saw a list comprehension I simply shook my head in utter confusion, the fact is that they are actually fairly straightforward, sure you can make them more and more complicated but here are some basic examples to get you started…

Take a look at the following, the first ‘i’ before the start of the for loop is referring to the variable ‘i’ within the for loop.

To turn the code into plain English to make it more clear: print i for each i in the range. Take a look at the code and it should make sense.

Lets look at this in a slightly different format…

Here is more of a ‘real world’ example, something I actually used just the other day when I was doing a little web scraping and the data was returning empty strings into the list, so I did a list comprehension to simply make a new list with the empty strings removed.

To get the same output without using a list comprehension might look like this…

So as you can see they are a little tidier and not that difficult to get started with. There’s a nice tutorial with further examples found here.

web scraping out in the wild

So I’ve done a little web scraping, mainly for myself, a little for work purposes but a few nights ago my wife was telling me about work and a task she had to do for one of her colleagues, basically fetching data from a website that was full of contractors information from within her industry – intrigued I took a look and thought I could probably automate this :)

I’m not going to reveal the website but here’s what went down.

Having a look around the site I realised all the links were in javascript this was the first hurdle, luckily it wasn’t actually going to be a problem…, I opened up a contractor page and the URL structure was like so…

http://www.domain.com/contractor-search/3547/contractor-name

I started playing with the URL to see how it worked and quickly realised that if I removed the contractor name the page would still resolve, basically I now had the same page but the URL was now like this…

http://www.domain.com/contractor-search/3547

Next I changed the numbers from 3547 (or whatever it was) to just a number 1, I pressed enter and nothing, the page resolved but it was a blank template. I tried 2 and bingo, there was the first contractors information – a few more tests and it was clear this number was an assignment number in order of the contractor being added to the database.

So this means I could by-pass all the javascript issue and spider each page directly.

The next step was even easier, the name of the contractor was in a h3 heading tag, the only h3 tag per page and the email of the contractor; the only two bits I needed were marked up with ‘mailto:’ attributes.

Within 30 minutes or so I had this running in the cloud doing all the hard work…

As you can see I’m using the range to iterate through 5000 times and I’m also using this as the URL numbers so the URL goes like this through each iteration…

http://www.domain.com/contractor-search/1

http://www.domain.com/contractor-search/2

http://www.domain.com/contractor-search/3

and so on…
http://www.domain.com/contractor-search/5000

python script to monitor site uptime

I wrote the following script in an attempt to monitor my clients-sites uptime, essentially if a sites goes down for whatever reason, I will be notified via email, this doesn’t include sites hosted by ourselves (Bronco) as they are monitored already, this is for sites where we only do consulting and they are hosted by others.

The reason I decided to make this was because I happened to be on a reviewing a clients site while it went down (I wasn’t doing anything but viewing the source from the browser!), anyway I notified the client and their development team and both were not aware that the site went down so I potentially saved some losses as it was quickly put back online.

The script itself, although it looks simple enough, was admittedly a little tricky; it uses multithreading in order to keep both loops running simultaneously.

The first function email_sender() is what it sounds like, it sends emails, this is powered by gmail, you need to add the email address and password of the account you wish to send the notifications from. You will likely want to authorise the server you are running the script from, start by trying to send an email from it, if it fails go to this link and authorise it, you then need to sent it again within 10 minutes and Google will whitelist it – You’ll need to be signed in.

The next function site_up(), runs through the sites you list and checks each one looking for a 200 status response code, if it receives anything else, it passes it on to a temporary dictionary that the second function is watching and deletes it from the main dictionary, after 15 minutes of being sat in the temporary dictionary it checks it again, if it is still returning anything other that a 200 response then it fires an email to the corresponding email address alerting you there is an issue (it’s set up like this so you can include colleagues with different email addresses) – Every 15 minutes it checks whether it is back up or not, each time sending an email, once it is back up it fires another email saying the site is once again live – it deletes it from the temporary dictionary and adds the site back into the main pool.

The site_up() function will continue monitoring all the other sites even when a site goes down and into the site_down() monitoring state.

Just an FYI, this is checking the site at a nice slow pace so there will be no concern of flooding the server; making it slow or bringing it down, it also won’t register as site traffic within analytics so there is no issues there, it will be viewable within the server logs but just inform the client in the unlikely event that they should question it.

Psst… If you liked this post make sure you subscribe —–>>>

mining all tweets with python

According to the Twitter api, they’ll allow you to extract a maximum of 3,200 tweets, 200 at a time, to do this I’m using Twython, a Python wrapper for the api.

After reading the docs and doing a little searching I couldn’t find anything on extracting the total number, I figured it was just some magic happening on the api side of things and with multiple calls it just knew to give you the next set of tweets, so I came up with something like this…

So basically this is just doing the function call 16 times (200 * 16 = 3,200 max number of tweets), all this did though was extract the latest 200 tweets 16 times.

After more reading I came across this in the api docs. It basically explains the use of the max_id and since_id parameters, essentially you can specify which id to extract from, the following from the docs should make it a little clearer…

To use max_id correctly, an application’s first request to a timeline endpoint should only specify a count. When processing this and subsequent responses, keep track of the lowest ID received. This ID should be passed as the value of the max_id parameter for the next request, which will only return Tweets with IDs lower than or equal to the value of the max_id parameter. Note that since the max_id parameter is inclusive, the Tweet with the matching ID will actually be returned again, as shown in the following image:

twitter_max_number_of_tweets

With this info and a little playing around I came up with the following but basically all you do is find out the tweet id for the last tweet the account made, for this you would do the following…

This id will be something like 467020906049835008, now we introduce the max_id parameter, we add this to a list and as we extract the tweets we add the twitter ids for each to the list and specify the max_id as the last item in the list (the last tweet id from each extraction).

Here is the final code along with import and authentication requirements.

Let me know if you found this useful or a more efficient way to write this.

parse an XML sitemap with python

Had to do this a lot this week so thought I’d make it easy.

working with csv files

When working with large datasets I tend to use Python as it’s a lot faster then excel for file manipulations and doesn’t crash on large inputs.

One thing I struggled with in the past was column selection. I have spoken to lots of different people and done plenty of reading on the subject but I think I have found the most elegant solution and even better it’s using the standard library – not Pandas that everyone seems to suggest.

Here it is…

Beautiful, isn’t it?

Now lets say you want to label each column as you print them…

Or maybe you want to search from something within a column to check it exists?

Here is a more job specific example, say if you wanted to count anchor text frequency of a backlink profile…

wordpress titles to lowercase with mysql

I randomly decided I didn’t want uppercase titles anymore on my blog posts, so I opened up ‘All Posts’ within my WordPress back-end to start running through them. Now there isn’t that many posts on my blog (30+) but it was still going to take a while to go through and update them all so I thought why not have a crack at coding it instead :)


I decided MySQL would be my best bet, plus I have been playing with SQLite3 recently for some minor Python projects, so I thought it would be good to practice some SQL commands.


The first hurdle is figuring out what database is used within your WP site (difficult if you have lots of sites running), to do this simply go to the main root folder of your WordPress install (on your server, I like filezilla) and locate the file called wp-config.php. Open this file you should see something like this…


<?php
/**
* The base configurations of the WordPress.
*
* This file has the following configurations: MySQL settings, Table Prefix,
* Secret Keys, WordPress Language, and ABSPATH. You can find more information
* by visiting {@link http://codex.wordpress.org/Editing_wp-config.php Editing
* wp-config.php} Codex page. You can get the MySQL settings from your web host.
*
* This file is used by the wp-config.php creation script during the
* installation. You don’t have to use the web site, you can just copy this file
* to “wp-config.php” and fill in the values.
*
* @package WordPress
*/

// ** MySQL settings – You can get this info from your web host ** //
/** The name of the database for WordPress */
define(‘DB_NAME’, ‘craigadd_wrdp15‘);


As you can see I have highlighted the database name above, this is what you are looking for.


Next you want to access this database within your hosting Cpanel, likely under phpmyadmin.


Once you’ve found it, look for the table ‘wp_posts’ and then within the MySQL coding section add the following code and take a deep breath and hit ‘GO’!


UPDATE wp_posts SET post_title = LOWER(post_title)


If all went to plan you should now have all lowercase post titles – like I do :)

if machines can do it, they should

Came across the following quote and it pretty much sums up my own thoughts so thought I’d publish it. Read the full article here.

 

“If your school or job forces you into repetition, and quitting isn’t an option, learn Python (or a similarly useful language). With that knowledge, script your way past purposeless mandates and spend your valuable time thinking about whatever fascinates you most.”

scraping twitter and facebook shares with python

There are obviously dozens of reasons to want to see Twitter and FaceBook shares, so I have written a surprisingly simple script to do the job for me – I thought this would go into 50+ lines of code at least. Obviously this is only checking against one URL but a slight modification and I could feed in multiple URLs of competitors, a news or blog section or an entire site.

It’s using the share apis and it parses the json data to display what we’re after. You can go here to see the original data that is fetched – http://cdn.api.twitter.com/1/urls/count.json?url=http://www.craigaddyman.com/an-interview-with-rand-fishkin-of-seomoz/

As you can see the following is displayed :
{“count”:31,”url”:”http:\/\/www.craigaddyman.com\/an-interview-with-rand-fishkin-of-seomoz\/”}

Its the “Count”:31 section we are after. Anyway the code and output are below!!

Output…

mass link duplication checker with python

I wrote this script as I was getting pissed off with Excel crashing when checking a measly 100,000 links; working with formulas, specifically VLOOKUP can be a real nightmare!

It seems to be working fine, I haven’t encountered any bugs etc. but I would like to speed it up further if I can. I have loaded close to a million URLs into it and it runs fine but that takes about an hour and a half when running it straight off my desktop.

If anyone would like to give me suggestions on how to get it running faster, that would be great and of course, credit given when credit is due! :)

Here’s the code! (Anyone else want to use this feel free)

Update

So I asked for some help on making this program run quicker, the code above can check around a million urls in an hour and a half, it turns out the above code is actually fast simply by stripping out the fancy terminal progress bar I added on lines 55 – 58, simply removing this now checks a milliion urls in around 7.5 seconds!

The new code then would look like this…


You can see the discussion of this here.

Marc Poulin also offered another rewrite of the script to make it even quicker, seen here and below…


Thanks to everyone else who gave me suggestions too :)

checking http response codes in python

I’ve just been trying to check for canonical issues on a sites domain and my usual tool of choice was showing response codes that I though were incorrect, it was almost as if there was a Meta refresh happening.

Anyway I wrote a quick script to double check it for me. Here is it…


I added the user input section simple for ease of use, in case I were to ever use it again.

I can do a slight modification to iterate over a file of URls if I should ever need to. Anyway, just a little problem that I solved within about 3 minutes! :)

2013 recap and 2014 goals

Wow what a year, most notable was of course getting married, after 7 years we finally did it. A small, intimate family gathering at Gretna Green; it was a fantastic day with smiles all round.

I started working at Bronco for Dave & Becky Naylor, so while you lot pay to see Dave speak, I have him sat opposite me :)

For the first time in my life I have also started saving, I have a pot going for a house, one for a pension and one for personal – watching it add up really is satisfying!

The last big thing I guess is my learning of Python, Looking back to when I first started looking at Python it and some of the scripts I was writing to some of the cool things I have made now is amazing (to me anyway).

So 2014…

 

Well I’m know back at the gym after a long layoff due to an unrelated injury and I am planning on getting in good shape – More on this later.

By the end of the year I’d like to have a good amount of my house deposit saved.

I plan on having a massive de-junk of my life; letting go of the two dozen domains and half built sites I have, being one of them.

Cut down on stimulants such as caffeine & get better sleep.

Keep on improving my Python & development knowledge – Maybe move towards launching web app.

so that’s it, I don have some more personal goals that I’m not sharing, I only had one goal last year and that was to learn Python and although that is a gradual process that will never end, I do feel like I succeeded. It will be interesting to look back next year and see what I did or didn’t succeed at.

image manipulation with python

So having recently got married, my Mrs wanted all the wedding photos turning into grey scale, I originally planned to do this with pixlr.com one at a time as I thought she only wanted a few doing – it turned out to be all of them and there are around 600 of them!

I took this as an opportunity to extend my python knowledge and I ended up with this…

This runs through all images in the directory the script is ran in and makes a grey scale version of each with the file name grey_original_file_name.jpg

 

keeping your desktop tidy – like a boss!

I’m still on my path to Pythonic enlightenment and today on my break I wrote a little script that I’m now going to be using to keep my desktop tidy and my files well organised. Basically I’m terrible for saving my files to my desktop; not only is this messy and unorganised, it’s dangerous as if my computer dies, my work might with it – so my files should be stored on the company’s backed-up common drive.

When I save files I always start them with the clients name followed by a brief description; for example, client_name_link_analysis.csv, due to this I was able to use it to my advantage in letting a script organise things for me.

Here it is…

Note!! There is a newer version at the end of this post!

So lets break it down a little bit!

The following are the imports that we need to run some of the functions, these are just part of the standard library so no special installs here – Just install python and create a new file from IDLE, make the necessary changes and double click it to run it (just like you do to open any other program).

This next bit you don’t actually need but it just depends how you are going to use it. Basically this changes the current working directory to the desktop, i’m saying you probably don’t need it as you’ll just have this script sat on your desktop ready to run. It’s only needed if you wanted to clean it while the program was sat in another directory!

Ok so the next lines are for your destination folder locations, wherever they my be, just paste in the folder locations like how I have and assign new variables too if you’d like. Note: if you change the variables i.e. “client_folder1″ to your actual clients name then make sure you change everywhere else in the program too.

The next line is using the glob function; glob.glob(“*.*”) the two * are wild cards, so if you only wanted to find .txt files you would go  glob.glob(“*.txt”). So the first asterix is the file name and the second is the file extensions, we’re using two wildcards as we want all files to be checked and moved. This is also starting a for loop which is use to reiterate over all files and starts a series of changes and checks against the files…

This next line is apart of the above for loop and it’s basically saying turn the file names into a string str() (aka text) and make them lowercase .lower() This is so it’s readable no matter how you write it in the file i.e. client_link_analysis.csv Vs. CLIENT_link_analysis.csv.

So this next section of code continues with the reiteration; it’s basically saying if the name of the client is within the file name do this… it prints it to the screen and then moves it to the destination folder (the print was really just for testing and so you know things have moved).  shutil.move(x, client_folder1) is the function that does the moving, x is the file and client_folder1 is the variable that is assigned the destination folder path as above.

This then carries on for different conditions, if you want to add or remove conditions use the elif section to do this!

The last two blocks of code are doing the following; the elif is basically telling the program to ignore itself, otherwise it will shoot off to the ‘other’ folder which I’ll get to in just a second. If you come up with a more creative name for the file, don’t forget to change it here.

The else statement is saying everything else that doesn’t match these conditions move to the ‘other’ folder. This is just odds and ends really but obviously just adjust to suit your needs.

So that’s it. Obviously it needs changing to what you need it for, my actual working script looks quite different to this but this is the basic workings of it.

Any questions let me know – oh and share this if you like it!!!

UPDATE.
I have rewritten it to be more maintainable, thanks to advice from Erwin de Keijzer

Now to add new clients into the mix I just update the dictionary with the client name and folder locations and nothing else, this will make it cleaner and easier to update and maintain.

percentage calculator in python

At least once per week I find myself needing to work out some kind of percentage and rather then firing up excel or using a calculator (or Google) I tend to go to percentagecalculator.net, obviously Excel is used for big jobs but I’m just talking about one-off’s here.

Anyway I thought I’d have a crack at a desktop version for myself to make it a little faster and to actually make something useful out of Python rather than a game.

Here’s the script…

 

rock, paper, scissors, lizard, spock #python

I finished creating my rock, paper, scissors, lizard, Spock game tonight (probably only an hour or two of actual coding but a few days of thinking were needed lol). Here are the rules…

Scissors cut paper
Paper covers rock
Rock crushes lizard
Lizard poisons Spock
Spock smashes scissors
Scissors decapitate lizard
Lizard eats paper
Paper disproves Spock
Spock vaporises rock
Rock crushes scissors

Basically you pick a number that refers to one of the gestures above and the computer randomly selects one also and a winner is selected. I still need to get a grip on error handling as what I am doing so far is a little flaky – It works fine regardless!

The following code contains bits that aren’t needed or are just commend out and other bits may seem a little backwards but the reason is because I made this a certain way to adhere to a project’s specific requirements on a course I’m taking and I altered this to allow for user input, just to see if I could and I thought I’d leave it for now in case it is useful.

Here’s my code (I’ve left the keys I originally used to work out the Maths and I’ve made everything fairly well commented so if your a n00b like moi you’ll be able to figure it out).

 

skype you sneaky #$@&%*! (you too chrome!)

Long story short, a Skype extension has been installed on my Chrome browser without my permission. I’m unable to uninstall the bloody thing and it says it has full access to my computer and data on websites I visit!

So what’s going on? Google published a post about not allowing extensions to be silently installed which you can see here http://blog.chromium.org/2012/12/no-more-silent-extension-installs.html Even our good friend Matt Cutts is commenting on the good news.

If you also have this annoying install I have figured how to at least disable it, if you go to the following extensions page you will see you are unable to get rid of it chrome://extensions/ instead go to chrome://plugins/ and you can disable it from here, then right click the Skype icon in the top right of your browser and select ‘hide’.

It’s not completely getting rid of it but it is disabling it.

So there you have it, Skype doing a sneaky install and Chrome allowing it. Shame on you both. (Google privacy? Ha! Yeah right.)