dr_tectonic | Jargon at work

I had a pleasantly geeky day at work, and I don't really expect anyone to care or even necessarily to follow, but I just felt like I needed to write it down. Maybe as an explanation for why when people ask what I did at work I usually respond: "Oh, y'know, stuff."

Currently, I'm working on cleaning up the swiki for our game so that I'm comfortable having strangers (that is to say, potential users) look at it when they want to find out more about the game and hopefully download it.

It's gotten kind of tangled, and I was having a hard time figuring out what was where and how I should move some things around. So I decided I needed to be able to look at the link structure, to see how it was organized.

I started by diagramming things on paper, and that lasted for all of about a minute before I decided it was too much work. I am a geek. I have a computer. It should do the work for me.

So first I grab the data, thinking I might get something out of just eyeballing it. A quick command-line loop to pull down all the pages with wget. Then I extract the titles into one directory and iterate on a few perl one-liners to parse out the content from the href tags into another to get the links. (It takes a few tries to discover all the things you want to exclude.)

(perl -ne 'if(m|href="/dd/(\d+)|g){print "$1\n";}' $i | sort | uniq > linx/$i, not that you actually care -- but I wrote it down as I went, because I might want to do it again someday.)

Hunh. Can't tell too much by looking -- just that there's a lot of leaf nodes and a few hubs (quelle surprise) so it won't be too bad to clean up. So now I look for visualizers on the web. Touchgraph will show me link structure relative to other websites, but not the internals, so that's no good. I find another graph visualizer that'll do it, though. Not a very good graph visualizer, mind you, but a free one designed for exactly this kind of thing.

Download it, unpack it, run it, dear lord it actually runs! Keen. Feed it the URL, but nope, it doesn't really like that. Well, what kind of datafile will it read? Poke around and find a reference webpage talking about XGMML files, which seem to be what it wants. Figure out the format. Easy enough -- throw around a few more one-line scripts to cut and paste together an XML file, feed it to the program, and voila. A graph.

Okay, it's really NOT a very good visualizer. Like, it sucks a lot. But it shows me some useful stuff -- these pages cluster together, and those twenty things are all part of that blob, okay, enough to tell that this is actually useful information. But not enough, because the UI reeeeeally sucks and worse, it's crashy.

But I have this random little Processing app that I fiddled with a while back that'll animate a graph. I can adapt it pretty quick. So I boot up Processing -- oh, hey, there's a new version, install that, update the code to post-beta so it runs, twiddle it to work with the new data (which is a graph and not a tree), teach it how to read an XML file, and hey presto, the picture that I actually wanted.

And now, I can fiddle with it long enough to determine that yes, indeed, this stuff is over here, and that stuff is over there, and okay, that means that these things are all safely stuffed into that closet and not spilling out into the bedroom. So all I need to do is clean up this area and that area, tuck those things out of sight behind that set of links, update these bits, and it'll be, well, passable, at least.

So now I know where things are. I have about a page or so of notes on what to fix on the main page, I know what to update and what to archive, and I did a bunch of it before I went home for the day. Hooray! I got something accomplished!

Threaded | Top-Level Comments Only

From:

toosuto.livejournal.com

Why exactly does perl look so scary? Do you have any reading/web-reading suggestions to make it parseable and cool as opposed to cartoon swearing for one like me?

Also: hooray for the For Science icon! For too long has it been missing: inspirational and minty fresh!

ng-nighthawk.livejournal.com

Unsurprisingly, we work a great deal alike.

My geek story: yesterday I decided to script a bunch of tests together so that I could press a button, leave it to run on a server somewhere, and come back when they were all done. But then I discovered that I could also take the results (which by default are stored in a series of XML files) and log them to a database so I could query into a reporting service rather than using the test app's reporting interface (which I find clunky and barely useful).

However, the documentation was incorrect: there were a whole series of data points that were stored in a completely different object than what was documented, so on first run these data points were always zero. After trying to decide if this was a type conversion error (since these data points happened to be doubles whereas the rest were returned as strings) I scanned the XML and determined that the elements corresponded to the objects, and that I was looking for the data in the wrong place. Once I pointed my script at the correct object it worked perfectly.

HOWEVER (and the rest was just an advance to this question) I have to ask myself whether the day I spent figuring that out was really worth it, or whether I would have been better off figuring out how to use the interface better. I'm almost positive, given the number of runs these tests will have, that the time spent will ultimately be worth the control I now have over the data.

But do you ever question whether you're on the completely wrong track, and at what point you should just turn back and realize you should have taken the more manual approach and really even though you've wasted a day looking into it, if you don't suck it up now you'll end up wasting even more time? I find I want to do things the "right way" which means the efficient, elegant way I have it designed in my head rather than the clunky, brute force way that takes a long time, offers no efficiency for multiple iterations, but is simple and takes little thought or research and will almost certainly work the first time.

Sorry--feeling rambly this morning. But you know what I mean?

da-lj.livejournal.com

A couple of questions, if you don't mind:

I'm very curious what exactly you animated- some sort of dynamic map of zooming in and out?

How standalone is this Processing app? (I'd love to play with it, if it isn't too grungy 'n' tied to other stuff. I've browsed the Processing site before, but left it in the category of stuff To Investigate Later. Well, this is close enough to cool to warrant being Later.)

And finally in the possibly-annoying-geeky-hindsight vein, was graphviz the free visualizer you grabbed? I've heard of mapping a site with 'apache2dot', which uses apache logs to output graphviz format, and running wget on a site to make the apache logs for processing.

Anyway, the last time I needed to do something like this, I made an ugly kludge of a script that didn't even make a graphical output, just a text tree...

dr-tectonic.livejournal.com

Perl looks scarier than it is. It has random symbols scattered all over because they are shorthand to communicate all kinds of things that would take up a whole lot more space if you spelled it all out.

So, for example,

m|href="/dd/(\d+)|g

Breaks down as follows:

"m//" is the match operator; it tells you whether the current string matches whatever's between the slashes. The "g" is the global modifier, so instead of finding just the first match, it keeps going and finds all the matches.

Now, sometimes you're looking for a pattern that will have lots of slashes in it, so instead of escaping those with backslashes (which gives you lots of '\/'s all over), you can change the delimiter characters on the match. So "m||" and "m.." mean the same thing as "m//", it's just more convenient to write it differently sometimes.

The chunk 'href="/dd/' is just the chunk that we're looking for. Inside a regular expression (or regexp, the thing we're looking to match), "\d" means "a digit", and + is a modifier meaning "one or more". So (\d+) means "at least one digit", and then the parentheses have the effect of marking the chunk that matches for later use.

So

means, "Look over the string and find me all the spots where we have the string 'href="/dd/' followed by some number, and set those numbers aside so I can do something with them."

Which is loquacious enough in English that I'm sure you can imagine how it might turn into a page or more of code in a language that didn't have all kinds of shorthand notation for the complicated bits.

Reading suggestions:

The O'Reilly books. Learning Perl (aka the llama book) and Programming Perl (aka the camel book).

They are totally the way to go. I might be able to loan you my copies; I'll send you email.

And just for you: another instance of my "A-ha!" icon. =)

Oh, absolutely! And sometimes I have looked back on it to say, "wow, that was a total waste of time. I should have just done it the braindead way".

Sometimes the stupid low-tech way really is better -- like just grabbing a marker and a piece of scratch paper to make a temporary sign instead of mucking around in Word for half an hour.

I think with practice, you start to acquire some instincts for deciding when it's a blind alleyway and when you'll really save time making yourself a new tool to do it right. But yeah, sometimes you still make the wrong choice.

It's a display of the graph structure - pages as nodes, links as edges. Doesn't need to be animated, but I found some physics rules that makes things lay themselves out nicely, and it's kinda nice to be able to drag stuff around to see what goes where better.

Processing is TOTALLY stand-alone. It exports webpages with applets, even. I'll see if I can find someplace to stick the couple I was working with, so you can see what I was talking about.

I just poked around with graphviz, and it's pretty clever (although the documentation SUCKS!), but it wouldn't really produce useful output - the graph it made of my swiki structure was, like, 8000 pixels wide.

The Mad Schemes of Dr. Tectonic

The Secret Identity of Beemer, Baron Mustache-Wax

Jargon at work

Jargon at work

no subject

no subject

oo, cool.

no subject

no subject

no subject

Re: oo, cool.