Introducing “map gardening”

Another flaw in the human character is that everybody wants to build and nobody wants to do maintenance.
– Kurt Vonnegut, Hocus Pocus p. 238

At the recent Association of American Geographers conference in Los Angeles, I presented some of my in-progress dissertation research on the dynamics of editing behavior in OpenStreetMap. In the next few blog posts I will summarize what I presented at the conference and what I’ve been doing since then. In this post in particular I’ll explain what I’m trying to accomplish, and describe the basic components of my analysis. I will save most of the results for other blog posts. I welcome comments and critiques, especially from members of the OpenStreetMap community.

If you want, you can skip ahead to the maps and charts.

UPDATE: I presented this work at the OpenStreetMap State of the Map US conference in June. You can watch a recording of the presentation here.

Motivation

One of the fundamental questions in my research involves the distinction between building and maintenance in crowdsourced online knowledge projects such as Wikipedia or OpenStreetMap, projects where both building and maintenance are largely performed by volunteers without financial gain. While I disagree with the sentiment in this Vonnegut quote—clearly there are people who do enjoy some kinds of maintenance—intuitively we might expect that building and maintenance involve different kinds of tasks, often requiring different types of skills or different personality types, and usually offering different kinds of (non-monetary) rewards.

Why is this important? While the volume of Volunteered Geographic Information (VGI) continues its rapid growth, many collaborative mapping projects are approaching a mature phase, where the primary activity shifts away from the initial creation of data to its ongoing maintenance. For example, in many parts of the world, OpenStreetMap’s road network is approaching “completeness” (at least by some metrics, such as total length of line features), leaving OSM contributors to move on to mapping other features, such as adding buildings and other points-of-interest, or elaborating on existing features by introducing more attribute information (such as adding lane counts to motorways or modeling turn restrictions at intersections).

The tasks of adding attribute data and periodically updating the map to reflect changes on the ground may seem less glamorous than the initial process of filling in the blank spots on the map, but they are no less important. The accuracy and reliability of volunteered data depends on this continual process of maintenance, and supporters of VGI often tout user-based quality control as a key feature that makes VGI superior to proprietary, expert-created data sources.

Despite its importance, however, the nature of volunteer-led maintenance of geographic information remains poorly understood. Most research to date does not differentiate between various types of interaction with VGI datasets, treating all contributions (and all contributors) in the same way. This leaves many questions unanswered. For example: are the users who initially contribute data the same users who maintain it later? Do the types of tasks undertaken by users change over the duration of their engagement with a project? Are initial contributors more or less likely to have local knowledge of a place than those who maintain that data later?

In my research, I am developing a typology of different modes of interaction with VGI datasets, specifically to distinguish between the acts of adding new data and editing existing data. Drawing from similar non-geospatial peer production projects such as Wikipedia, I adapt the concept of “wiki gardening” and propose the term “map gardening” to describe volunteer-based maintenance of VGI. In this blog post I describe my initial quantitative analysis of the OSM database to determine the extent and frequency of these different types of interaction.

Defining building and maintenance (or “exploring” and “gardening”)

Defining the difference between building and maintenance is difficult and subjective. Crowdsourced projects such as OpenStreetMap are never “finished” and remain in a constant state of revision and flux. One person might consider their contributions to OSM to be part of building the project, while someone else might perceive the same activity as a form of maintenance. Elsewhere in my dissertation I dive into this question further, including the social aspects of motivation and reward that surround the distinctions between building and maintenance.

For the purposes of the analysis I am presenting in this post, however, I take a simple approach: I divide OSM edits into two categories:

  1. Contributions to areas that are “blank spots” on the map at the time the edit occurred. I will often call these “blankspot edits” for brevity. These edits represent mappers who are filling in the map for the first time; in a sense, they are “exploring” the unknown.
  2. Any edits in the database in areas where some features already exist. This category includes several different kinds of edits: introducing new features next to existing features, modifying the shape of existing features, or adding or modifying attribute data (known in OSM as “tags”) to existing features. I am calling all of these activities “map gardening”.

Clearly, different assumptions made at this point would result in very different quantitative findings from the analysis. In later phases of my research I plan to refine this typology to further subdivide the types of editing activities I am currently grouping together under the umbrella concept of gardening. More accurately, we might think of “exploring” and “gardening” as two ends of a continuum.

The question of imports

This classification of individual editor behavior will help us understand individuals’ roles within the overall evolution of OSM, and will also help us make comparisons between different areas around the world. In particular, this analysis may help cast light on an ongoing debate within the OpenStreetMap community about the role of data imports. In some parts of the world, most notably in the United States, the bulk of the data within OSM was not contributed by volunteer mappers but rather was imported en masse from external data sources.

Number of features in the OpenStreetMap database. The spike in 2007 caused by the TIGER data import is clearly visible.

Number of features in the OpenStreetMap database. The spike in 2007 caused by the TIGER data import is clearly visible. Source: http://wiki.openstreetmap.org/wiki/Stats

In the case of the US, the federal government’s TIGER data is available under a public domain license, which is compatible with OpenStreetMap’s Creative Commons-based license (now the similar ODbL). The TIGER data provided OSM with a complete street network in the United States which could then be refined and built-upon by OSM volunteers. On the one hand, data imports appear to be great advantage to OSM, giving the project a jump start and getting the map closer to the point of providing a free and usable world map without having to create every feature “from scratch”. However, many people within the OSM community argue that imports hinder the development of strong local mapping communities, by giving the impression that the map is already finished. By this argument, there would be fewer active mappers who would detect errors in the TIGER data or keep it up to date as facts on the ground change over time.

In my research I hope to add to this discussion about imports, by investigating how editing activity differs between areas that experienced major data imports and those that did not. By looking at the editing history of contributors within each of these areas, we may be able to answer whether or not the presence of imports actually hinders the growth of local mapping communities.

Data and analysis

I downloaded the OSM “full history dump” http://wiki.openstreetmap.org/wiki/Planet.osm/full which includes every previous version of every feature within the OSM database. There are a few caveats here: the OSM database structure changed significantly with the introduction of API version 0.5 in October 2007. For features that existed before October 2007, the history dump only includes the version that existed at the moment of the changeover. For example, if a node was created in 2005, then modified in 2006, only the 2006 version of that node exists in the database. What that node looked like in 2005 (and who edited it) is now lost.

I used a history dump that includes all versions up to June 1, 2012. I chose this cut-off date to avoid analyzing any of the redactions that occurred in the OSM database starting in July 2012 in preparation for the transition to the ODbL license.

For the purposes of this analysis I disregarded ways and relations, and only looked at the nodes in the database (which includes not just POIs, but all the nodes that define the geometry of ways, as well). I used Peter Körner’s powerful osm-history-splitter and osm-history-renderer tools to extract a small number of study areas and load them into a PostGIS database.

In each study area I overlaid a grid aligned with the appropriate UTM zone for each area. I began with 1000m grid spacing (although I am also testing smaller grid sizes). Within each grid cell, I queried PostGIS to find the earliest node within each cell. These nodes I called the “blank spot” edits, as they were mapped in a grid cell that was empty at the moment they were added to the database. All other nodes I treat as “gardening” edits.

Having tagged each node as a blank spot edit or not, I then calculated statistics for each unique user who had at least one edit within the given study area. Any edits that each user made outside the study area were not counted. I calculated the following statistics for each user:

  • Number of edited nodes – The number of times the user’s name appears in the database. This count is not the same as OSM’s concept of an “edit”: because we are treating each node individually, if a contributor creates a polygon that comprises 10 nodes, that would create 10 edited node records in the database.
  • Number of version 1 nodes – Similar to the count of edited nodes, but includes only nodes that are version 1 in the database. This signifies the moment that a feature is created in the database. Any subsequent modification (changing its location or adding attribute data) will increment the version number. The “version 1 edits” resemble our concept of “exploring” in contrast to subsequent version edits which fall into our concept of “map gardening”. However, using version 1 edits as an indicator of “exploring” is challenging in at least two ways. First, the newly-added version 1 nodes could represent features that are immediately adjacent to existing features, such as buildings, bus stops, fire hydrants, etc. They might also include minor roads in between existing features, a process that Corcoran et al. call “densification”. The second problem is that the OSM database provides no guarantee that a real-world feature is always represented by the same database object (a specific node, in this case). Frequently OSM editors who are updating a feature may delete the database object entirely and recreate it anew. In this case an activity that we would clearly call “gardening” would produce a new version 1 feature in the database.
  • Number of edited nodes in “blank spots on the map” – To avoid the problems described with counting version 1 edits, we can identify version 1 edits that occur in grid cells where no other nodes exist. These are the “blank spot edits” described earlier.
  • First editing date – The date of the contributor’s first edit, in other words, the first date the user appears in the database. As the map fills in over time, there will be fewer opportunities to make “blank spot” edits. Thus, OSM contributors who joined the project early on are likely to have made more blank spot edits.
  • Total number of days editing – A count of the number of unique days (measured from midnight to midnight in the time zone for each study area) that the contributor edited nodes in the database.
  • Mean editing date – The average of all the dates in which the contributor was active in the database.
  • Weighted mean editing date – The average of all the dates in which the contributor was active in the database, weighted by the number of edits that occurred on each date.

This is by no means an exhaustive list of statistics that could be generated using this dataset.

Study areas

I selected a few study areas, mostly metropolitan areas in Europe and North America, with the specific goal of looking for differences between areas that experienced data imports and those that did not. I plan to keep adding new study areas as I move forward.

To illustrate how the OpenStreetMap coverage in these study areas followed very different evolutionary paths, here are some animations of London, Seattle, and Vancouver, showing how each map changed over time.

London:

London is a prime example of an area mapped “from scratch” by volunteers, with no significant imports. The map fills in steadily and quickly, with most of the blank spots filled in by 2008. Continued refinement is visible throughout the map from 2008 onward.

Seattle:

Seattle is a typical example of a US city that shows the result of a major data import. The map begins with very little data, then fills in suddenly due to the large TIGER import in 2007. However, even after the TIGER import, continuous gardening activity is apparent, much like in the London map.

Vancouver:

Vancouver shows slow but continuous growth, marked by several imports that are much smaller than Seattle’s TIGER import, and occur much later, beginning around 2009. In contrast to the US, data imports in the Canadian context were delayed because of concerns that the licensing terms on the Canadian government map data were incompatible with OpenStreetMap’s license. By the time the legal uncertainty had been resolved in late 2008, OSM contributors had already mapped a significant amount of Canada’s major cities.

Mapping the “blank spot” edits

I used latitude/longitude bounding boxes to extract the study areas from the OSM full history database as a whole. I then created analysis grids aligned to the appropriate UTM zone for each study area. The following maps show the study area, along with the analysis grid and the nodes identified as “blank spot” edits. Move your mouse over each node to see the date of the edit and the contributor’s username. Nodes lacking a username were contributed by anonymous editors.

The basemap shows a rendering of every version of every line and polygon edge present in the OSM database. Thus, thicker and darker lines are not necessarily more important features, but are simply those features that have been edited more frequently (having more versions present in the database). Lines that were subsequently deleted from OSM are still present on these maps.

London:
london

Seattle:
seattle

Vancouver:
vancouver

 

Scatterplots

Finally, here are some scatterplots showing the differences in editing styles among contributors in each study area. Each dot represents a unique username in the database. The dots are sized according to the number of days each user was active editing the map. The dots are positioned according to how many total nodes the user edited (the x axis, scaled logarithmically) and how many blankspot edits the user made (the y axis, also a log scale).

Note that there is a clear diagonal asymptote along the line x = y (not shown), which is expected, because it would be impossible for the number of blankspot edits to be greater than the total number of edits. Thus, the closer a user comes to that diagonal line, the higher their percentage of blankspot edits. In other words: users closer to that line tend to be “map explorers”, while users along the x axis at the bottom of the chart tend to be “map gardeners”. Move your mouse over each dot to see the username and number of edits.

 

More to come in later posts…

This entry was posted in Mapping, Networks, not_geowebchat. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

2 Comments

  1. Posted May 27, 2013 at 6:02 am | Permalink

    So are there basically no manual map creators? It’s difficult to get a sense from the scatterplots, but it seems like most of the individuals are down along the “gardener” x-axis, and many of those that aren’t have order of 10,000 blank spot edits, indicating that their creation came from imports of existing databases.

  2. Posted May 27, 2013 at 11:21 am | Permalink

    W.P., you can see the exact number of blank spot edits by mousing over the points on the scatterplot. So the account with the most blank spots is DaveHansenTiger, with 4475 blank spots in the Seattle data. As you might guess from “Tiger” in the name, that is the account that was used for the main TIGER import. Similarly, Vancouver has mbiker_imports_and_more, which is one of the main importer accounts there. There are still a lot of human mappers who have relatively high numbers of blank spot edits, but of course nobody can compete with the imports in Seattle and Vancouver. It’s interesting to compare with the London data, where those contributors in the upper right corner are definitely manual map creators.

    But be careful comparing the raw values on the x axis with those on the y axis. My standard for identifying “map exploring” in this first pass through the data is very strict: being the first one to edit in each 1 km square grid cell is a very high bar to attain. Only the very early mappers are likely to be considered explorers by this metric. I have also been experimenting with smaller grid cells (500m and 250m) which results in a much larger number of edits categorized as blank spot edits. But the overall pattern is still the same: lots of map gardeners. That’s what I’m working on next, teasing apart that big mass of users that I’m currently lumping together as gardeners.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>