Monday, May 25, 2009
Tesseract does all the data harvesting and analysis we want to do, but presents the data in a complex, freestanding web app. We'd work on tweaking the analysis portion to work well with the data the Hadley Centre keeps (if neccesary), getting it to run as an unattended part of the project management/repository back end, and pushing the congruence data it generates to extremely simple views within the project's Trac site.
Thursday, May 21, 2009
- I set up toy local Trac and subversion servers to look at what information's available out of the box. It turns out that Trac doesn't really track anything that could be useful for building a graph of straight up social interactions. This suggests some things about how to set up the project - our repository authorship graph maker is a totally separate module from the social network graph maker, both export to a common network representation format, the recommendation engine combines them and spits out information, and the Trac plugin serves pretty views on that info. This is probably the best way to set it up regardless of the social network information source (especially if we want to be able to adapt it to different VCSs and viewers,) but it's good to start thinking about more concrete choices.
- It's my understanding that at the Hadley Centre, they would likely be able to feed all work email history into the social graph maker, and that guided my description of how to create a social graph from yesterday. I'd really like to make a suite of tools that could potentially be useful to other projects, though, so it's worth thinking about what resources others might have available. Many open source projects use mailing lists to communicate, and it makes sense to base a social graph of mailing list participants on who has replied to whom. More on this as I consider it.
- How should we track LOC edited? I don't know whether Hadley uses BDB or FSFS for their subversion backend. FSFS introspection looks pretty straightforward: each revision has an author, each revision file has a list of deltas, followed by a list of information about the files revised. It'd probably be better to use existing parsers, even if all we want is linecount/filename.
Tuesday, May 19, 2009
To build the graphs:
Create a social relationship graph.
Look at email to: and from: fields in the tracked communications and give each pair of people a relationship point for each time one emails the other. Use the relationship points to determine strength of connection in a relationship graph.
Create a code relatedness graph.
For each pair of code modules, give them a relatedness point for each time they've been checked in at the same time. This code relatedness thing could get much more complex, but I understand there's a lot of source visualization software out there that's already solved these problems, so we could look at them.
Create a module-by-module expertise listing.
For each code module, look at the subversion history and record the number of lines of code each distinct author has added, changed, and deleted over the life of the module (LOC edited).
Created a shared authorship graph. This one's still very rough
- For each pair of people, for each code module, give them min(A's LOC edited, B's LOC edited) shared authorship points.
- For each pair of people, for each pair of related code modules, give them (min(A's LOC edited in both, B's LOC edited in both)*relatedness/something) shared authorship points.
- Rationale: two heavy editors should get a higher rating than one heavy editor and one light editor, hence the min() construction.
- Edits in related modules should count for less than edits in the same module, hence the "/something," denominator probably to be determined by dumb tweaking until it lines up with results of surveying the coders about their network or something.
- Total shared authorship points between each pair of people is strength of connection in the graph
The primary purpose would be to decide on a threshold difference between relationship points and shared authorship points at which we'd consider a pair of people not to be communicating effectively. If Alice and Bob have 2000 authorship points but only 500 relationship points, we would add them to each other's recommended collaborators feed, available as a widget down the side of the Trac project home page with a link to one another's emails or something.
- People can input the name of a module and get back a list of the experts on that module (determined by LOC edited), and maybe a list of related module expertise search links.
- To really reach, the above could be smarter, perhaps. If I'm writing an in-trac email or bug report that mentions modules by name, it could automatically suggest additional people to copy the ticket to.
- You could have a list of experts in modules you've recently checked in as a quick-contact box (with manual add and stickying people allowed).
- Managers can see a visualization of discrepancies between the social and shared authorship graphs to help diagnose organizational inefficiencies.
- When Bob shows up on Alice's collaborators feed, she can click "Who's Bob?" and see a graph of of the social network with paths between her and Bob highlighted.
- Should expertise slowly expire? It could make sense for experience within the last year to count more than experience from several years back. This would mean counting expertise points as LOC edited as a function of time - not hard to do since we'll be getting our info from diffs anyways, but it stinks of unnecessary complexity.
- Should we allow for diff-by-diff updates of the graphs, or assume it'll just be fully rebuilt once a week or whatever? Probably the latter to start off, until we have an idea of just how big the organization is.
- Must make sure to keep in mind that we're doing all this fancy footwork in order to deliver a final product that's extremely simple so people might actually use it. Other social network graphing solutions exist, we need to focus on making ours simple and directed. The recommended collaborators feature fits, but not all of the others do.
Note: Thanks to Ainsley for terminology correction, and please see her similar post for more information on these ideas.
Friday, May 15, 2009
Now, a pretty graph is not too useful, so what else could you do? Well, there's the original plan we had of generating a code authorship network as well, overlaying them, and identifying some discrepancies as inefficiencies in the project that can be fixed by introducing people to one another or even restructuring teams. That sounds hard.
Even harder would be some sort of semantic analysis - I have a word cloud culled from emails and tickets relating to each person, and when I submit a ticket or send an email, it suggests more people to add to the recipient field based on keywords I've just typed.
Hmm, so I guess where I'm at is that I can see how to set up the basic system, but I'm not sure whether I can get more than one dataset so we can have comparisons and recommendations, rather than just straight up visualization of what's already happening. So, off to research! I'll check out the free tools on Wikipedia's page of social network analysis software, and start a search for what's been done in the way of repository introspection.
Social Network Analysis from Project Management Data
This is what I'll be investigating for the next week. So far, I haven't found any closely similar projects, but the field itself is daunting. "Who should fix this bug," discussed in a previous post, was an ambitious analysis tool of bug tracking information with a smaller results scope and a better sense of what would constitute success, and they got only so-so results out of their project. Wikipedia's coverage of building social network graphs is making my head explode. It's a lot to take in, so I'll try to list issues to investigate here:
- Where can I get input data? Ideally, I'd grab the full backend database for an instantiation of a Trac variant supporting a real, somewhat long-lived and complex project. I'm told I should ask Greg Wilson and David Wolover about getting DrProject history.
- Once I have data and have processed it, what do I plan on doing with it? How would I test my results? The bug assignment team could compare predicted bugfix assignees to who actually closed the ticket, what's my metric? Would a comparison to some sort of aggregated graph of contact like from Google's social graphing results be fruitful? It's unlikely, since we're ranking social contact within a work environment, while most social networking data online is voluntary. Maybe some sort of survey set up for participants to rate their working relationships with one another? This seems like the best route, but I'd have to set it up ahead of time so as not to fall into post-hoc analysis trap.
- What about the graph itself? Should connections be directionally weighted (I think that's the term) that is, if everyone contacts the intern to assign her small tasks but she usually only contacts her direct supervisor, should we keep track of the distinction or just collapse it into "has contact with many people"? Should we count mentions of each other's names in communication? Changes in assigned-to status from A to B as a link between them? Actual emails? Should some links count more than others? By how much? What sort of crazy voodoo could possibly guide my choice there? I think one thing to do would be to construct different graphs for different contact types, with the ability to overlay/combine them later. Another possibility is to take a page from how these scientists run their models and gather survey results first, then run experiments on our program to change weightings to get it to closely match the survey results.
- What sort of out-of-the-box solutions are available to me for visualizing social networks? What about for graphs like this in general?
- Should I be planning on making something that's specifically suited to their team? Or a more general tool?
- Apparently people are trying for an open standard on disambiguating social links. They're kind of cool, and could be useful for a variety of our projects here.
Tuesday, May 12, 2009
Because an attempt to create or support a large-scale crawler would be madness, I figure we'd use an existing search service to find new research papers based on users' queries. I'm not sure, however, what would be accessible to us.
We might qualify for access to google research, but it would tightly limit what we could do with the project at the end, possibly making our results useless unless they're adopted by a research paper search company. The google search API is probably largely useless, as results are limited to 64 entries and, moreover, the terms require that the search component not comprise the core of your app or webpage.
Scraping search results from a free or pay service is almost certainly out of the question. I'm pretty excited about this project as a practical one that's within my abilities once the search source is figured out, though. There are a few services out who seem to be using google scholar results, maybe it's easier than it looks; see Publish or Perish - I don't know how theyse guys are licensed - and Pubfeed - Maria reports insufficient results on this one, but it's a local project, so I'll ask around. This 'touchgraph' does it an interesting way: it's a bookmarklet, so they don't need to return google search results elsewhere. Not quite applicable, but it's getting me thinking about alternate ways of doing this.
Configuration Management for Large-Scale Scientific Computing at the UK Met Office
A description of developing and deploying a new content management system for the research group. I have a slightly better handle on their current processes and information that'll be available to us. For instance, much of the old version history was imported when they moved to subversion a few years ago. The key takeaway for me was how much support and customization was required to get them to adopt a new system. Any tools we build will have to be extremely easy to use with obvious and immediate benefit if they're to be useful. Simplicity will be the byword.
Where’s the Real Bottleneck in Scientific Computing? and Software Carpentry
Quick reads on the basics computational scientists should be taught. Basically covers the material in CSC108 and CSC148 from a slightly different perspective.
Software Development Environments for Scientific and Engineering Software: A Series of Case Studies
Gives some insight as to how researchers come to conclusions about software engineering, but not really worth the read. Skip to section 5 for conclusions about how large scientific computing teams work.
Who should fix this bug?
An extremely interesting look at a project to cull information from bug reports and CVS repositories for Eclipse and Mozilla for automatic recommendations as to who should be assigned new bugs. It looks to me like what they worked on was way out of scope for the time and expertise our team has available, but it's from a few years back, so there may be further projects and tools available now that we could model our attempts at developing social network models from repository information on. Even if we don't use anything like this, it's an illuminating look at the complexities involved in developing and testing an aggregator from this sort of data.
Internet Groupware for Scientific Collaboration
An overview of group collaboration software as of 2000. I found this really useful as an introduction to the culture of the discourse; some of the comments made by Steve and Greg make more sense in the context of the goals and challenges of group collaboration online here. The much more recent post Now that’s what I call social networking… kind of helped tie it in to current technology trends for me.
The Django Book
I'm coming around. It feels like slower going than learning Rails because they focus heavily on making explicit things that just kind of happened in Rails. I really do appreciate that level of control, however, and I think I'm going to enjoy working in it.
I'm taking the modest head start I've got as license to spend some extra time reading up on the problem domain and thinking about the possible projects suggested by Steve Easterbrook and Greg Wilson.
Research alerts with a social component
People set up queries and receive alerts when relevant papers are released. Analysis of queries and/or results to suggest contacts with people researching similar topics.
- It would be nice to keep things loosely coupled so we can have a central place for queries with the ability for people to add new frontends for different places to use it. For example: I have a widget on my blog that informs me and others of what's been recently recommended. It suggests I talk to B who uses the same service through a facebook app and C who uses it through a dedicated website and D who uses a desktop app that automatically harvests papers searched for (like last.fm? - this one would be hard and way off spec, but fun for a future project idea)
- Would want to have 'roles' available - I might not want to restrict my results to people who have the same two sub specialties as I,
- Are we looking at piggybacking on existing search engines? That would make sense, but we'd need to ensure we're respecting fair use.
Electronic lab book
These researchers are using basic wikis to keep research notes. How can we make this more useful? Can we mostly replicate the function of paper lab books so that research processes can be more easily shared?
- I need to see some paper lab books or grill a scientist. Really, I have no idea what would be useful.
- The most basic thing would be a suite of wiki templates. They're probably already using something like this.
- Is it realistic to consider whether they might move to tablet computers + handwriting recognition office software soon, letting them simply use the screen as they've been using paper? I've read that the technology's supposed to get much cheaper soon, and that windows 7 is slated for inbuilt support, so maybe that'll just happen for them as it gets more broadly adopted. "Someone else will probably fix it in the future" isn't a very good plan, though
Construct a graph of social interactions by mining old emails, forums, agendas, and team lists. Make another of code dependencies related to authorship information. Compare the graphs, with an eye to determining whether and which discrepancies are evidence of communication inefficiencies.
This would be an interesting project, but making it reasonably transferrable to analysis of information from other organizations sounds like a beast of a job. One you've got social network graphs from other sources such as the research alerts project, however, they shouldn't be too tough to combine, the trick would be to figure out how significant the differences are and whether you're generating useful comparisons. The data can then be used for a variety of tools. As you can probably tell, I'm a little hazy on this whole process, but reading up. For a much more cogent explanation, see this post by Steve.
Ways to easily add visualizations of data to papers and websites
It looks like there's already a lot of quality work going on in standards for embedding the code that generates the visualization into the research paper itself. I'm not quite sure where we could help but the idea's been floateding around, so I'm leaving it here as a reminder to ask around.
What I'm reading
Engineering the Software for Understanding Climate Change
An overview of working environment of the researchers we'll be trying to help. Focuses on the differences between their processes and ones we're more used to in software development, and on challenges to productivity that could be solved by software engineering tools and practices.
The Django Book
Not excited yet. Must press on.
I'm an undergraduate student at the University of Toronto working on software development support tools for climate scientists, funded by a national science and engineering research council undergraduate summer research award (NSERC-USRA). I'm working with four other students under the supervision of Steve Easterbrook.
So far, the only constraint is that we work on developing tools that might be useful to the researchers at the Met Office Hadley Centre and similar departments around the world. I'm just starting to learn about how they work. These researchers develop complex software models of climate systems and run them as experiments, comparing results with other projections and real world observation. They work in Fortran on code that has components still in use decades after their original conception, and have recently started collaborating more heavily with other research groups abroad.
I'm starting this blog to help me organize my thoughts for my summer research position. I figure I'll discuss what I'm working on now and where I think we should be going in the future, plus any difficulties I'm having. I'll toss up what I've been reading with links and summaries to jog my memory, too. I hope it'll be useful to my teammates to be able to see what I'm working on - and to correct me when I'm off base.