Shallow Thoughts : : programming

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sat, 05 Aug 2017

Keeping Git Branches in Sync

I do most of my coding on my home machine. But when I travel (or sit in boring meetings), sometimes I do a little hacking on my laptop. Most of my code is hosted in GitHub repos, so when I travel, I like to update all the repos on the laptop to make sure I have what I need even when I'm offline.

That works great as long as I don't make branches. I have a variable $myrepos that lists all the github repositories where I want to contribute, and with a little shell alias it's easy enough to update them all:

allgit() {
    pushd ~
    foreach repo ($myrepos)
        echo $repo :
        cd ~/src/$repo
        git pull
    end
    popd
}

That works well enough -- as long as you don't use branches.

Git's branch model seems to be that branches are for local development, and aren't meant to be shared, pushed, or synchronized among machines. It's ridiculously difficult in git to do something like, "for all branches on the remote server, make sure I have that branch and it's in sync with the server." When you create branches, they don't push to the server by default, and it's remarkably difficult to figure out which of your branches is actually tracking a branch on the server.

A web search finds plenty of people asking, and most of the Git experts answering say things like "Just check out the branch, then pull." In other words, if you want to work on a branch, you'd better know before you go offline exactly which branches in which repositories might have been created or updated since the last time you worked in that repository on that machine. I guess that works if you only ever work on one project in one repo and only on one or two branches at a time. It certainly doesn't work if you need to update lots of repos on a laptop for the first time in two weeks.

Further web searching does find a few possibilities. For checking whether there are files modified that need to be committed, git status --porcelain -uno works well. For checking whether changes are committed but not pushed, git for-each-ref --format="%(refname:short) %(push:track)" refs/heads | fgrep '[ahead' works ... if you make an alias so you never have to look at it.

Figuring out whether branches are tracking remotes is a lot harder. I found some recommendations like git branch -r | grep -v '\->' | while read remote; do git branch --track "${remote#origin/}" "$remote"; done and for remote in `git branch -r`; do git branch --track ${remote#origin/} $remote; done but neither of them really did what I wanted. I was chasing down the rabbit hole of writing shell loops using variables like

  localbranches=("${(@f)$(git branch | sed 's/..//')}")
  remotebranches=("${(@f)$(git branch -a | grep remotes | grep -v HEAD | grep -v master | sed 's_remotes/origin/__' | sed 's/..//')}")
when I thought, there must be a better way. Maybe using Python bindings?

git-python

In Debian, the available packages for Git Python bindings are python-git, python-pygit2, and python-dulwich. Nobody on #python seemed to like any of them, but based on quick attempts with all three, python-git seemed the most straightforward. Confusingly, though Debian calls it python-git, it's called "git-python" in its docs or in web searches, and it's "import git" when you use it.

It's pretty straightforward to use, at least for simple things. You can create a Repo object with

from git import Repo
repo = Repo('.')
and then you can get lists like repo.heads (local branches), repo.refs (local and remote branches and other refs such as tags), etc. Once you have a ref, you can use ref.name, check whether it's tracking a remote branch with ref.tracking_branch(), and make it track one with ref.set_tracking_branch(remoteref). That makes it very easy to get a list of branches showing which ones are tracking a remote branch, something that had proved almost impossible with the git command line.

Nice. But now I wanted more: I wanted to replace those baroque git status --porcelain and git for-each-ref commands I had been using to check whether my repos needed committing or pushing. That proved harder.

Checking for uncommitted files, I decided it would be easiest stick with the existing git status --porcelain -uno. Which was sort of true. git-python lets you call git commands, for cases where the Python bindings aren't quite up to snuff yet, but it doesn't handle all cases. I could call:

    output = repo.git.status(porcelain=True)
but I never did find a way to pass the -uno; I tried u=False, u=None, and u="no" but none of them worked. But -uno actually isn't that important so I decided to do without it.

I found out later that there's another way to call the git command, using execute, which lets you pass the exact arguments you'd pass on the command line. It didn't work to call for-each-ref the way I'd called repo.git.status (repo.git.for_each_ref isn't defined), but I could call it this way:

    foreachref = repo.git.execute(['git', 'for-each-ref',
                                   '--format="%(refname:short) %(push:track)"',
                                   'refs/heads'])
and then parse the output looking for "[ahead]". That worked, but ... ick. I wanted to figure out how to do that using Python.

It's easy to get a ref (branch) and its corresponding tracking ref (remote branch). ref.log() gives you a list of commits on each of the two branches, ordered from earliest to most recent, the opposite of git log. In the simple case, then, what I needed was to iterate backward over the two commit logs, looking for the most recent SHA that's common to both. The Python builtin reversed was useful here:

    for i, entry in enumerate(reversed(ref.log())):
        for j, upstream_entry in enumerate(reversed(upstream.log())):
            if entry.newhexsha == upstream_entry.newhexsha:
                return i, j

(i, j) are the number of commits on the local branch that the remote hasn't seen, and vice versa. If i is zero, or if there's nothing in ref.log(), then the repo has no new commits and doesn't need pushing.

Making branches track a remote

The last thing I needed to do was to make branches track their remotes. Too many times, I've found myself on the laptop, ready to work, and discovered that I didn't have the latest code because I'd been working on a branch on my home machine, and my git pull hadn't pulled the info for the branch because that branch wasn't in the laptop's repo yet. That's what got me started on this whole "update everything" script in the first place.

If you have a ref for the local branch and a ref for the remote branch, you can verify their ref.name is the same, and if the local branch has the same name but isn't tracking the remote branch, probably something went wrong with the local repo (like one of my earlier attempts to get branches in sync, and it's an easy fix: ref.set_tracking_branch(remoteref).

But what if the local branch doesn't exist yet? That's the situation I cared about most, when I've been working on a new branch and it's not on the laptop yet, but I'm going to want to work on it while traveling. And that turned out to be difficult, maybe impossible, to do in git-python.

It's easy to create a new local branch: repo.head.create(repo, name). But that branch gets created as a copy of master, and if you try to turn it into a copy of the remote branch, you get conflicts because the branch is ahead of the remote branch you're trying to copy, or vice versa. You really need to create the new branch as a copy of the remote branch it's supposed to be tracking.

If you search the git-python documentation for ref.create, there are references to "For more documentation, please see the Head.create method." Head.create takes a reference argument (the basic ref.create doesn't, though the documentation suggests it should). But how can you call Head.create? I had no luck with attempts like repo.git.Head.create(repo, name, reference=remotebranches[name]).

I finally gave up and went back to calling the command line from git-python.

repo.git.checkout(remotebranchname, b=name)
I'm not entirely happy with that, but it seems to work.

I'm sure there are all sorts of problems left to solve. But this script does a much better job than any git command I've found of listing the branches in my repositories, checking for modifications that require commits or pushes, and making local branches to mirror new branches on the server. And maybe with time the git-python bindings will improve, and eventually I'll be able to create new tracking branches locally without needing the command line.

The final script, such as it is: gitbranchsync.py.

Tags: , ,
[ 14:39 Aug 05, 2017    More programming | permalink to this entry | comments ]

Tue, 23 May 2017

Python help from the shell -- greppable and saveable

I'm working on a project involving PyQt5 (on which, more later). One of the problems is that there's not much online documentation, and it's hard to find out details like what signals (events) each widget offers.

Like most Python packages, there is inline help in the source, which means that in the Python console you can say something like

>>> from PyQt5.QtWebEngineWidgets import QWebEngineView
>>> help(QWebEngineView)
The problem is that it's ordered alphabetically; if you want a list of signals, you need to read through all the objects and methods the class offers to look for a few one-liners that include "unbound PYQT_SIGNAL".

If only there was a way to take help(CLASSNAME) and pipe it through grep!

A web search revealed that plenty of other people have wished for this, but I didn't see any solutions. But when I tried running python -c "help(list)" it worked fine -- help isn't dependent on the console.

That means that you should be able to do something like

python -c "from sys import exit; help(exit)"

Sure enough, that worked too.

From there it was only a matter of setting up a zsh function to save on complicated typing. I set up separate aliases for python2, python3 and whatever the default python is. You can get help on builtins (pythonhelp list) or on objects in modules (pythonhelp sys.exit). The zsh suffixes :r (remove extension) and :e (extension) came in handy for separating the module name, before the last dot, and the class name, after the dot.

#############################################################
# Python help functions. Get help on a Python class in a
# format that can be piped through grep, redirected to a file, etc.
# Usage: pythonhelp [module.]class [module.]class ...
pythonXhelp() {
    python=$1
    shift
    for f in $*; do
        if [[ $f =~ '.*\..*' ]]; then
            module=$f:r
            obj=$f:e
            s="from ${module} import ${obj}; help($obj)"
        else
            module=''
            obj=$f
            s="help($obj)"
        fi
        $python -c $s
    done
}
alias pythonhelp="pythonXhelp python"
alias python2help="pythonXhelp python2"
alias python3help="pythonXhelp python3"

So now I can type

python3help PyQt5.QtWebEngineWidgets.QWebEngineView | grep PYQT_SIGNAL
and get that list of signals I wanted.

Tags: , ,
[ 14:12 May 23, 2017    More programming | permalink to this entry | comments ]

Thu, 06 Apr 2017

Clicking through a translucent window: using X11 input shapes

It happened again: someone sent me a JPEG file with an image of a topo map, with a hiking trail and interesting stopping points drawn on it. Better than nothing. But what I really want on a hike is GPX waypoints that I can load into OsmAnd, so I can see whether I'm still on the trail and how to get to each point from where I am now.

My PyTopo program lets you view the coordinates of any point, so you can make a waypoint from that. But for adding lots of waypoints, that's too much work, so I added an "Add Waypoint" context menu item -- that was easy, took maybe twenty minutes. PyTopo already had the ability to save its existing tracks and waypoints as a GPX file, so no problem there.

[transparent image viewer overlayed on top of topo map] But how do you locate the waypoints you want? You can do it the hard way: show the JPEG in one window, PyTopo in the other, and do the "let's see the road bends left then right, and the point is off to the northwest just above the right bend and about two and a half times as far away as the distance through both road bends". Ugh. It takes forever and it's terribly inaccurate.

More than once, I've wished for a way to put up a translucent image overlay that would let me click through it. So I could see the image, line it up with the map in PyTopo (resizing as needed), then click exactly where I wanted waypoints.

I needed two features beyond what normal image viewers offer: translucency, and the ability to pass mouse clicks through to the window underneath.

A translucent image viewer, in Python

The first part, translucency, turned out to be trivial. In a class inheriting from my Python ImageViewerWindow, I just needed to add this line to the constructor:

    self.set_opacity(.5)

Plus one more step. The window was translucent now, but it didn't look translucent, because I'm running a simple window manager (Openbox) that doesn't have a compositor built in. Turns out you can run a compositor on top of Openbox. There are lots of compositors; the first one I found, which worked fine, was xcompmgr -c -t-6 -l-6 -o.1

The -c specifies client-side compositing. -t and -l specify top and left offsets for window shadows (negative so they go on the bottom right). -o.1 sets the opacity of window shadows. In the long run, -o0 is probably best (no shadows at all) since the shadow interferes a bit with seeing the window under the translucent one. But having a subtle .1 shadow was useful while I was debugging.

That's all I needed: voilà, translucent windows. Now on to the (much) harder part.

A click-through window, in C

X11 has something called the SHAPE extension, which I experimented with once before to make a silly program called moonroot. It's also used for the familiar "xeyes" program. It's used to make windows that aren't square, by passing a shape mask telling X what shape you want your window to be. In theory, I knew I could do something like make a mask where every other pixel was transparent, which would simulate a translucent image, and I'd at least be able to pass clicks through on half the pixels.

But fortunately, first I asked the estimable Openbox guru Mikael Magnusson, who tipped me off that the SHAPE extension also allows for an "input shape" that does exactly what I wanted: lets you catch events on only part of the window and pass them through on the rest, regardless of which parts of the window are visible.

Knowing that was great. Making it work was another matter. Input shapes turn out to be something hardly anyone uses, and there's very little documentation.

In both C and Python, I struggled with drawing onto a pixmap and using it to set the input shape. Finally I realized that there's a call to set the input shape from an X region. It's much easier to build a region out of rectangles than to draw onto a pixmap.

I got a C demo working first. The essence of it was this:

    if (!XShapeQueryExtension(dpy, &shape_event_base, &shape_error_base)) {
        printf("No SHAPE extension\n");
        return;
    }

    /* Make a shaped window, a rectangle smaller than the total
     * size of the window. The rest will be transparent.
     */
    region = CreateRegion(outerBound, outerBound,
                          XWinSize-outerBound*2, YWinSize-outerBound*2);
    XShapeCombineRegion(dpy, win, ShapeBounding, 0, 0, region, ShapeSet);
    XDestroyRegion(region);

    /* Make a frame region.
     * So in the outer frame, we get input, but inside it, it passes through.
     */
    region = CreateFrameRegion(innerBound);
    XShapeCombineRegion(dpy, win, ShapeInput, 0, 0, region, ShapeSet);
    XDestroyRegion(region);

CreateRegion sets up rectangle boundaries, then creates a region from those boundaries:

Region CreateRegion(int x, int y, int w, int h) {
    Region region = XCreateRegion();
    XRectangle rectangle;
    rectangle.x = x;
    rectangle.y = y;
    rectangle.width = w;
    rectangle.height = h;
    XUnionRectWithRegion(&rectangle, region, region);

    return region;
}

CreateFrameRegion() is similar but a little longer. Rather than post it all here, I've created a GIST: transregion.c, demonstrating X11 shaped input.

Next problem: once I had shaped input working, I could no longer move or resize the window, because the window manager passed events through the window's titlebar and decorations as well as through the rest of the window. That's why you'll see that CreateFrameRegion call in the gist: -- I had a theory that if I omitted the outer part of the window from the input shape, and handled input normally around the outside, maybe that would extend to the window manager decorations. But the problem turned out to be a minor Openbox bug, which Mikael quickly tracked down (in openbox/frame.c, in the XShapeCombineRectangles call on line 321, change ShapeBounding to kind). Openbox developers are the greatest!

Input Shapes in Python

Okay, now I had a proof of concept: X input shapes definitely can work, at least in C. How about in Python?

There's a set of python-xlib bindings, and they even supports the SHAPE extension, but they have no documentation and didn't seem to include input shapes. I filed a GitHub issue and traded a few notes with the maintainer of the project. It turned out the newest version of python-xlib had been completely rewritten, and supposedly does support input shapes. But the API is completely different from the C API, and after wasting about half a day tweaking the demo program trying to reverse engineer it, I gave up.

Fortunately, it turns out there's a much easier way. Python-gtk has shape support, even including input shapes. And if you use regions instead of pixmaps, it's this simple:

    if self.is_composited():
        region = gtk.gdk.region_rectangle(gtk.gdk.Rectangle(0, 0, 1, 1))
        self.window.input_shape_combine_region(region, 0, 0)

My transimageviewer.py came out nice and simple, inheriting from imageviewer.py and adding only translucency and the input shape.

If you want to define an input shape based on pixmaps instead of regions, it's a bit harder and you need to use the Cairo drawing API. I never got as far as working code, but I believe it should go something like this:

    # Warning: untested code!
    bitmap = gtk.gdk.Pixmap(None, self.width, self.height, 1)
    cr = bitmap.cairo_create()
    # Draw a white circle in a black rect:
    cr.rectangle(0, 0, self.width, self.height)
    cr.set_operator(cairo.OPERATOR_CLEAR)
    cr.fill();

    # draw white filled circle
    cr.arc(self.width / 2, self.height / 2, self.width / 4,
           0, 2 * math.pi);
    cr.set_operator(cairo.OPERATOR_OVER);
    cr.fill();

    self.window.input_shape_combine_mask(bitmap, 0, 0)

The translucent image viewer worked just as I'd hoped. I was able to take a JPG of a trailmap, overlay it on top of a PyTopo window, scale the JPG using the normal Openbox window manager handles, then right-click on top of trail markers to set waypoints. When I was done, a "Save as GPX" in PyTopo and I had a file ready to take with me on my phone.

Tags: , , ,
[ 17:08 Apr 06, 2017    More programming | permalink to this entry | comments ]

Sat, 25 Mar 2017

Reading keypresses in Python

As part of preparation for Everyone Does IT, I was working on a silly hack to my Python script that plays notes and chords: I wanted to use the computer keyboard like a music keyboard, and play different notes when I press different keys. Obviously, in a case like that I don't want line buffering -- I want the program to play notes as soon as I press a key, not wait until I hit Enter and then play the whole line at once. In Unix that's called "cbreak mode".

There are a few ways to do this in Python. The most straightforward way is to use the curses library, which is designed for console based user interfaces and games. But importing curses is overkill just to do key reading.

Years ago, I found a guide on the official Python Library and Extension FAQ: Python: How do I get a single keypress at a time?. I'd even used it once, for a one-off Raspberry Pi project that I didn't end up using much. I hadn't done much testing of it at the time, but trying it now, I found a big problem: it doesn't block.

Blocking is whether the read() waits for input or returns immediately. If I read a character with c = sys.stdin.read(1) but there's been no character typed yet, a non-blocking read will throw an IOError exception, while a blocking read will wait, not returning until the user types a character.

In the code on that Python FAQ page, blocking looks like it should be optional. This line:

fcntl.fcntl(fd, fcntl.F_SETFL, oldflags | os.O_NONBLOCK)
is the part that requests non-blocking reads. Skipping that should let me read characters one at a time, block until each character is typed. But in practice, it doesn't work. If I omit the O_NONBLOCK flag, reads never return, not even if I hit Enter; if I set O_NONBLOCK, the read immediately raises an IOError. So I have to call read() over and over, spinning the CPU at 100% while I wait for the user to type something.

The way this is supposed to work is documented in the termios man page. Part of what tcgetattr returns is something called the cc structure, which includes two members called Vmin and Vtime. man termios is very clear on how they're supposed to work: for blocking, single character reads, you set Vmin to 1 (that's the number of characters you want it to batch up before returning), and Vtime to 0 (return immediately after getting that one character). But setting them in Python with tcsetattr doesn't make any difference.

(Python also has a module called tty that's supposed to simplify this stuff, and you should be able to call tty.setcbreak(fd). But that didn't work any better than termios: I suspect it just calls termios under the hood.)

But after a few hours of fiddling and googling, I realized that even if Python's termios can't block, there are other ways of blocking on input. The select system call lets you wait on any file descriptor until has input. So I should be able to set stdin to be non-blocking, then do my own blocking by waiting for it with select.

And that worked. Here's a minimal example:

import sys, os
import termios, fcntl
import select

fd = sys.stdin.fileno()
newattr = termios.tcgetattr(fd)
newattr[3] = newattr[3] & ~termios.ICANON
newattr[3] = newattr[3] & ~termios.ECHO
termios.tcsetattr(fd, termios.TCSANOW, newattr)

oldterm = termios.tcgetattr(fd)
oldflags = fcntl.fcntl(fd, fcntl.F_GETFL)
fcntl.fcntl(fd, fcntl.F_SETFL, oldflags | os.O_NONBLOCK)

print "Type some stuff"
while True:
    inp, outp, err = select.select([sys.stdin], [], [])
    c = sys.stdin.read()
    if c == 'q':
        break
    print "-", c

# Reset the terminal:
termios.tcsetattr(fd, termios.TCSAFLUSH, oldterm)
fcntl.fcntl(fd, fcntl.F_SETFL, oldflags)

A less minimal example: keyreader.py, a class to read characters, with blocking and echo optional. It also cleans up after itself on exit, though most of the time that seems to happen automatically when I exit the Python script.

Update, 2017: It turns out this doesn't work in Python 3: 3 needs some extra semantics when opening the file. For a nice example of nonblocking read in Python 3, see ballingt: Nonblocking stdin read works differently in Python 3.

Tags: ,
[ 12:42 Mar 25, 2017    More programming | permalink to this entry | comments ]

Fri, 24 Feb 2017

Coder Dojo: Kids Teaching Themselves Programming

We have a terrific new program going on at Los Alamos Makers: a weekly Coder Dojo for kids, 6-7 on Tuesday nights.

Coder Dojo is a worldwide movement, and our local dojo is based on their ideas. Kids work on programming projects to earn colored USB wristbelts, with the requirements for belts getting progressively harder. Volunteer mentors are on hand to help, but we're not lecturing or teaching, just coaching.

Despite not much advertising, word has gotten around and we typically have 5-7 kids on Dojo nights, enough that all the makerspace's Raspberry Pi workstations are filled and we sometimes have to scrounge for more machines for the kids who don't bring their own laptops.

A fun moment early on came when we had a mentor meeting, and Neil, our head organizer (who deserves most of the credit for making this program work so well), looked around and said "One thing that might be good at some point is to get more men involved." Sure enough -- he was the only man in the room! For whatever reason, most of the programmers who have gotten involved have been women. A refreshing change from the usual programming group. (Come to think of it, the PEEC web development team is three women. A girl could get a skewed idea of gender demographics, living here.) The kids who come to program are about 40% girls.

I wondered at the beginning how it would work, with no lectures or formal programs. Would the kids just sit passively, waiting to be spoon fed? How would they get concepts like loops and conditionals and functions without someone actively teaching them?

It wasn't a problem. A few kids have some prior programming practice, and they help the others. Kids as young as 9 with no previous programming experience walk it, sit down at a Raspberry Pi station, and after five minutes of being shown how to bring up a Python console and use Python's turtle graphics module to draw a line and turn a corner, they're happily typing away, experimenting and making Python draw great colorful shapes.

Python-turtle turns out to be a wonderful way for beginners to learn. It's easy to get started, it makes pretty pictures, and yet, since it's Python, it's not just training wheels: kids are using a real programming language from the start, and they can search the web and find lots of helpful examples when they're trying to figure out how to do something new (just like professional programmers do. :-)

Initially we set easy requirements for the first (white) belt: attend for three weeks, learn the names of other Dojo members. We didn't require any actual programming until the second (yellow) belt, which required writing a program with two of three elements: a conditional, a loop, a function.

That plan went out the window at the end of the first evening, when two kids had already fulfilled the yellow belt requirements ... even though they were still two weeks away from the attendance requirement for the white belt. One of them had never programmed before. We've since scrapped the attendance belt, and now the white belt has the conditional/loop/function requirement that used to be the yellow belt.

The program has been going for a bit over three months now. We've awarded lots of white belts and a handful of yellows (three new ones just this week). Although most of the kids are working in Python, there are also several playing music or running LED strips using Arduino/C++, writing games and web pages in Javascript, writing adventure games Scratch, or just working through Khan Academy lectures.

When someone is ready for a belt, they present their program to everyone in the room and people ask questions about it: what does that line do? Which part of the program does that? How did you figure out that part? Then the mentors review the code over the next week, and they get the belt the following week.

For all but the first belt, helping newer members is a requirement, though I suspect even without that they'd be helping each other. Sit a first-timer next to someone who's typing away at a Python program and watch the magic happen. Sometimes it feels almost superfluous being a mentor. We chat with the kids and each other, work on our own projects, shoulder-surf, and wait for someone to ask for help with harder problems.

Overall, a terrific program, and our only problems now are getting funding for more belts and more workstations as the word spreads and our Dojo nights get more crowded. I've had several adults ask me if there was a comparable program for adults. Maybe some day (I hope).

Tags: ,
[ 13:46 Feb 24, 2017    More programming | permalink to this entry | comments ]

Mon, 23 Jan 2017

Testing a GitHub Pull Request

Several times recently I've come across someone with a useful fix to a program on GitHub, for which they'd filed a GitHub pull request.

The problem is that GitHub doesn't give you any link on the pull request to let you download the code in that pull request. You can get a list of the checkins inside it, or a list of the changed files so you can view the differences graphically. But if you want the code on your own computer, so you can test it, or use your own editors and diff tools to inspect it, it's not obvious how. That this is a problem is easily seen with a web search for something like download github pull request -- there are huge numbers of people asking how, and most of the answers are vague unclear.

That's a shame, because it turns out it's easy to pull a pull request. You can fetch it directly with git into a new branch as long as you have the pull request ID. That's the ID shown on the GitHub pull request page:

[GitHub pull request screenshot]

Once you have the pull request ID, choose a new name for your branch, then fetch it:

git fetch origin pull/PULL-REQUEST_ID/head:NEW-BRANCH-NAME
git checkout NEW-BRANCH-NAME

Then you can view diffs with something like git difftool NEW-BRANCH-NAME..master

Easy! GitHub should give a hint of that on its pull request pages.

Fetching a Pull Request diff to apply it to another tree

But shortly after I learned how to apply a pull request, I had a related but different problem in another project. There was a pull request for an older repository, but the part it applied to had since been split off into a separate project. (It was an old pull request that had fallen through the cracks, and as a new developer on the project, I wanted to see if I could help test it in the new repository.)

You can't pull a pull request that's for a whole different repository. But what you can do is go to the pull request's page on GitHub. There are 3 tabs: Conversation, Commits, and Files changed. Click on Files changed to see the diffs visually.

That works if the changes are small and only affect a few files (which fortunately was the case this time). It's not so great if there are a lot of changes or a lot of files affected. I couldn't find any "Raw" or "download" button that would give me a diff I could actually apply. You can select all and then paste the diffs into a local file, but you have to do that separately for each file affected. It might be, if you have a lot of files, that the best solution is to check out the original repo, apply the pull request, generate a diff locally with git diff, then apply that diff to the new repo. Rather circuitous. But with any luck that situation won't arise very often.

Update: thanks very much to Houz for the solution! (In the comments, below.) Just append .diff or .patch to the pull request URL, e.g. https://github.com/OWNER/REPO/pull/REQUEST-ID.diff which you can view in a browser or fetch with wget or curl.

Tags: , ,
[ 14:34 Jan 23, 2017    More programming | permalink to this entry | comments ]

Thu, 19 Jan 2017

Plotting Shapes with Python Basemap wwithout Shapefiles

In my article on Plotting election (and other county-level) data with Python Basemap, I used ESRI shapefiles for both states and counties.

But one of the election data files I found, OpenDataSoft's USA 2016 Presidential Election by county had embedded county shapes, available either as CSV or as GeoJSON. (I used the CSV version, but inside the CSV the geo data are encoded as JSON so you'll need JSON decoding either way. But that's no problem.)

Just about all the documentation I found on coloring shapes in Basemap assumed that the shapes were defined as ESRI shapefiles. How do you draw shapes if you have latitude/longitude data in a more open format?

As it turns out, it's quite easy, but it took a fair amount of poking around inside Basemap to figure out how it worked.

In the loop over counties in the US in the previous article, the end goal was to create a matplotlib Polygon and use that to add a Basemap patch. But matplotlib's Polygon wants map coordinates, not latitude/longitude.

If m is your basemap (i.e. you created the map with m = Basemap( ... ), you can translate coordinates like this:

    (mapx, mapy) = m(longitude, latitude)

So once you have a region as a list of (longitude, latitude) coordinate pairs, you can create a colored, shaped patch like this:

    for coord_pair in region:
        coord_pair[0], coord_pair[1] = m(coord_pair[0], coord_pair[1])
    poly = Polygon(region, facecolor=color, edgecolor=color)
    ax.add_patch(poly)

Working with the OpenDataSoft data file was actually a little harder than that, because the list of coordinates was JSON-encoded inside the CSV file, so I had to decode it with json.loads(county["Geo Shape"]). Once decoded, it had some counties as a Polygonlist of lists (allowing for discontiguous outlines), and others as a MultiPolygonlist of list of lists (I'm not sure why, since the Polygon format already allows for discontiguous boundaries)

[Blue-red-purple 2016 election map]

And a few counties were missing, so there were blanks on the map, which show up as white patches in this screenshot. The counties missing data either have inconsistent formatting in their coordinate lists, or they have only one coordinate pair, and they include Washington, Virginia; Roane, Tennessee; Schley, Georgia; Terrell, Georgia; Marshall, Alabama; Williamsburg, Virginia; and Pike Georgia; plus Oglala Lakota (which is clearly meant to be Oglala, South Dakota), and all of Alaska.

One thing about crunching data files from the internet is that there are always a few special cases you have to code around. And I could have gotten those coordinates from the census shapefiles; but as long as I needed the census shapefile anyway, why use the CSV shapes at all? In this particular case, it makes more sense to use the shapefiles from the Census.

Still, I'm glad to have learned how to use arbitrary coordinates as shapes, freeing me from the proprietary and annoying ESRI shapefile format.

The code: Blue-red map using CSV with embedded county shapes

Tags: , , , , ,
[ 09:36 Jan 19, 2017    More programming | permalink to this entry | comments ]

Sat, 14 Jan 2017

Plotting election (and other county-level) data with Python Basemap

After my arduous search for open 2016 election data by county, as a first test I wanted one of those red-blue-purple charts of how Democratic or Republican each county's vote was.

I used the Basemap package for plotting. It used to be part of matplotlib, but it's been split off into its own toolkit, grouped under mpl_toolkits: on Debian, it's available as python-mpltoolkits.basemap, or you can find Basemap on GitHub.

It's easiest to start with the fillstates.py example that shows how to draw a US map with different states colored differently. You'll need the three shapefiles (because of ESRI's silly shapefile format): st99_d00.dbf, st99_d00.shp and st99_d00.shx, available in the same examples directory.

Of course, to plot counties, you need county shapefiles as well. The US Census has county shapefiles at several different resolutions (I used the 500k version). Then you can plot state and counties outlines like this:

from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

def draw_us_map():
    # Set the lower left and upper right limits of the bounding box:
    lllon = -119
    urlon = -64
    lllat = 22.0
    urlat = 50.5
    # and calculate a centerpoint, needed for the projection:
    centerlon = float(lllon + urlon) / 2.0
    centerlat = float(lllat + urlat) / 2.0

    m = Basemap(resolution='i',  # crude, low, intermediate, high, full
                llcrnrlon = lllon, urcrnrlon = urlon,
                lon_0 = centerlon,
                llcrnrlat = lllat, urcrnrlat = urlat,
                lat_0 = centerlat,
                projection='tmerc')

    # Read state boundaries.
    shp_info = m.readshapefile('st99_d00', 'states',
                               drawbounds=True, color='lightgrey')

    # Read county boundaries
    shp_info = m.readshapefile('cb_2015_us_county_500k',
                               'counties',
                               drawbounds=True)

if __name__ == "__main__":
    draw_us_map()
    plt.title('US Counties')
    # Get rid of some of the extraneous whitespace matplotlib loves to use.
    plt.tight_layout(pad=0, w_pad=0, h_pad=0)
    plt.show()
[Simple map of US county borders]

Accessing the state and county data after reading shapefiles

Great. Now that we've plotted all the states and counties, how do we get a list of them, so that when I read out "Santa Clara, CA" from the data I'm trying to plot, I know which map object to color?

After calling readshapefile('st99_d00', 'states'), m has two new members, both lists: m.states and m.states_info.

m.states_info[] is a list of dicts mirroring what was in the shapefile. For the Census state list, the useful keys are NAME, AREA, and PERIMETER. There's also STATE, which is an integer (not restricted to 1 through 50) but I'll get to that.

If you want the shape for, say, California, iterate through m.states_info[] looking for the one where m.states_info[i]["NAME"] == "California". Note i; the shape coordinates will be in m.states[i]n (in basemap map coordinates, not latitude/longitude).

Correlating states and counties in Census shapefiles

County data is similar, with county names in m.counties_info[i]["NAME"]. Remember that STATE integer? Each county has a STATEFP, m.counties_info[i]["STATEFP"] that matches some state's m.states_info[i]["STATE"].

But doing that search every time would be slow. So right after calling readshapefile for the states, I make a table of states. Empirically, STATE in the state list goes up to 72. Why 72? Shrug.

    MAXSTATEFP = 73
    states = [None] * MAXSTATEFP
    for state in m.states_info:
        statefp = int(state["STATE"])
        # Many states have multiple entries in m.states (because of islands).
        # Only add it once.
        if not states[statefp]:
            states[statefp] = state["NAME"]

That'll make it easy to look up a county's state name quickly when we're looping through all the counties.

Calculating colors for each county

Time to figure out the colors from the Deleetdk election results CSV file. Reading lines from the CSV file into a dictionary is superficially easy enough:

    fp = open("tidy_data.csv")
    reader = csv.DictReader(fp)

    # Make a dictionary of all "county, state" and their colors.
    county_colors = {}
    for county in reader:
        # What color is this county?
        pop = float(county["votes"])
        blue = float(county["results.clintonh"])/pop
        red = float(county["Total.Population"])/pop
        county_colors["%s, %s" % (county["name"], county["State"])] \
            = (red, 0, blue)

But in practice, that wasn't good enough, because the county names in the Deleetdk names didn't always match the official Census county names.

Fuzzy matches

For instance, the CSV file had no results for Alaska or Puerto Rico, so I had to skip those. Non-ASCII characters were a problem: "Doña Ana" county in the census data was "Dona Ana" in the CSV. I had to strip off " County", " Borough" and similar terms: "St Louis" in the census data was "St. Louis County" in the CSV. Some names were capitalized differently, like PLYMOUTH vs. Plymouth, or Lac Qui Parle vs. Lac qui Parle. And some names were just different, like "Jeff Davis" vs. "Jefferson Davis".

To get around that I used SequenceMatcher to look for fuzzy matches when I couldn't find an exact match:

def fuzzy_find(s, slist):
    '''Try to find a fuzzy match for s in slist.
    '''
    best_ratio = -1
    best_match = None

    ls = s.lower()
    for ss in slist:
        r = SequenceMatcher(None, ls, ss.lower()).ratio()
        if r > best_ratio:
            best_ratio = r
            best_match = ss
    if best_ratio > .75:
        return best_match
    return None

Correlate the county names from the two datasets

It's finally time to loop through the counties in the map to color and plot them.

Remember STATE vs. STATEFP? It turns out there are a few counties in the census county shapefile with a STATEFP that doesn't match any STATE in the state shapefile. Mostly they're in the Virgin Islands and I don't have election data for them anyway, so I skipped them for now. I also skipped Puerto Rico and Alaska (no results in the election data) and counties that had no corresponding state: I'll omit that code here, but you can see it in the final script, linked at the end.

    for i, county in enumerate(m.counties_info):
        countyname = county["NAME"]
        try:
            statename = states[int(county["STATEFP"])]
        except IndexError:
            print countyname, "has out-of-index statefp of", county["STATEFP"]
            continue

        countystate = "%s, %s" % (countyname, statename)
        try:
            ccolor = county_colors[countystate]
        except KeyError:
            # No exact match; try for a fuzzy match
            fuzzyname = fuzzy_find(countystate, county_colors.keys())
            if fuzzyname:
                ccolor = county_colors[fuzzyname]
                county_colors[countystate] = ccolor
            else:
                print "No match for", countystate
                continue

        countyseg = m.counties[i]
        poly = Polygon(countyseg, facecolor=ccolor)  # edgecolor="white"
        ax.add_patch(poly)

Moving Hawaii

Finally, although the CSV didn't have results for Alaska, it did have Hawaii. To display it, you can move it when creating the patches:

    countyseg = m.counties[i]
    if statename == 'Hawaii':
        countyseg = list(map(lambda (x,y): (x + 5750000, y-1400000), countyseg))
    poly = Polygon(countyseg, facecolor=countycolor)
    ax.add_patch(poly)
The offsets are in map coordinates and are empirical; I fiddled with them until Hawaii showed up at a reasonable place. [Blue-red-purple 2016 election map]

Well, that was a much longer article than I intended. Turns out it takes a fair amount of code to correlate several datasets and turn them into a map. But a lot of the work will be applicable to other datasets.

Full script on GitHub: Blue-red map using Census county shapefile

Tags: , , , , , ,
[ 15:10 Jan 14, 2017    More programming | permalink to this entry | comments ]