Guide to managing and organizing data
by Peter Wagner
June 10, 2021, last updated: March 01, 2022
We originally prepared this guide to help new Prison Policy Initiative staffers learn how we think about managing and organizing data. Unlike our Writing Guide, this guide doesn’t lay out pieces of our political vision. It does, however, offer tips to make errors less likely and profound conclusions more common, both while you are working on the project and months or years in the future when you or your colleagues return to that work.
It should go without saying there is not a “best” way to organize data or a project. This is just how the Prison Policy Initiative does it. Organizations with different strategies and different priorities will likely want to do things differently.
Most of this guide speaks to organizing data in Excel because that’s what we use most often. But it’s relevant to all kinds of organizing of digital and paper records. Actually formatting data for final publication is something that will be addressed in a later version of this memo or elsewhere.
The single most helpful thing to keep in mind is that you are organizing data for more purposes than just your own and that other people will be seeing it in various formats and at different times in the future. Here are four audiences to keep in mind:
- Yourself in six weeks. You are going to forget how you did stuff, so document things for Future You.
- A future coworker. Projects can get passed between people, need to be updated, or even overlap with other projects. You want your coworkers to be able to pick up your data and understand how it is organized, managed, and collected without a hassle.
- Your supervisor. Your supervisor has more experience than you and may have more subject matter knowledge than you, but will have far less experience with the data you are organizing than you do. They may need to look at your data after hours, so you need to make it clear and easy to understand to head off questions.
- The decision-makers you are hoping to influence. Ultimately, some or all of this data will be published with the goal of changing decisions that other people make. Because we wouldn’t be collecting data if the conclusion was already obvious, you can safely assume that the decision-makers in question are going to be skeptical of your work, your sources, and your assumptions.
Basic principles of data management:
- Label and describe everything
- Put a headline on your data.
- Put a caption on your data. This should include when data was collected or last updated.
- Label where data comes from. Often a row called “sources” will do the trick, so you can explain that one column comes from a certain part of the Census, and that another column is from the Bureau of Justice Statistics, and the next column was calculated by you by dividing the Bureau of Justice Statistics figure by the Census figure.
- Don’t hesitate to create multiple columns for notes and quirks to keep them organized and associated with the underlying data. For example, if the data for one row comes from a different source than other cells in that column, explain that fact and why.
- If your data contains some manipulations — including column deletions — from a larger source, be sure to keep your original data — i.e. from NCPSR or BJS — somewhere. Give it a clear filename that says what it is, and include the Google Sheets URL or folder path name in a note within your spreadsheet.
Consider including your formulas in plain English somewhere in your spreadsheet. Not only does it help with you double-checking your math, but it is also helpful for other people who need to understand your work.
- Distinguish between numbers that are the result of your own calculations and those that are derived directly from a source. Sometimes, to get the data you’re looking for, you need to combine information from multiple data sources. This can quickly get pretty confusing. To keep it clear, create a “Notes” column in your table, and anywhere that your methodology needs explaining or may be changed later, describe it as, “Calculated based on ….”. This will help you retrace your steps later and will simplify the process when writing up your methodology or explaining it to others.
If something is useless, wrong, or out of date, consider deleting it, as having multiple copies can invite errors. If you have a reason not to delete it, be sure to label it conspicuously (something along the lines of “Do Not Use” or “Not updated since xx/xx/xxxx”) and explain why you are keeping it so that Future You and everyone else will know.
Write clearly and consistently.
Be consistent. If you use different words to describe the same thing, you will invite questions about why they differ.
- Be cautious with abbreviations because abbreviations can change the sort order in unexpected ways or make joins/lookups fail. If you do use an abbreviation, consider adding its meaning in a key.
If any of your data sources have numeric identifiers, try to preserve them. (For example, FIPS or ANSI codes for geographies.) This can save you many headaches with data verification, publication, or data reuse later. It’s better to preserve this stuff now and decide not to publish that column than to have to recreate it later.
Specific suggestions for Excel:
- Use words, not colors, to signify meaning. Colors require a key to understand and are difficult or impossible to sort and filter based on. To the degree that you want to use colors for some internal tracking purpose, use “conditional formatting” to turn your words into a color.
- If your document has more than one sheet, make sure that your sheets/tabs are labeled, and that the first sheet/tab is called “Table of Contents,” “Instructions,” or similar and contains information on where content is located in the document. When publishing or emailing this file, be sure to save it with the table of contents sheet already selected. For an excellent example of a table of contents on a published spreadsheet, see our work collecting data on incarcerated populations by race/ethnicity and gender for each state.
- Do not use blank columns in your Excel tables, as this makes it easy for the two parts to be sorted separately and for your data to become gibberish. If you need some space between sections, adjust the column width and formatting as appropriate, or create a blank column and give it the header “spacer.” Note: It’s probably easier to just adjust the formatting within Excel without creating fake columns, but if you are going to convert the table to HTML, it is generally easier to have a fake spacer row. It is particularly difficult to insert blank columns once the HTML table has been prepared, so to the degree that your table requires fancy formatting that may or may not translate easily to HTML, you may want to bring this to the attention of whoever will be translating your table to HTML before they start that process.
- If you want to work with a subset of the data — for example for graphing purposes — one good technique is to copy the main spreadsheet into a new tab in the same file before further manipulating it (i.e., deleting unnecessary rows/columns, making new calculations, etc.). Then label the new tab clearly, this will help you keep track of the original dataset and the manipulated dataset, and ensure you don’t accidentally delete or change any data you may need later.
- Have a procedure for accurately merging tables together. Either use a lookup or have a process you can use to check that things matched after you paste items together. (For example, if you wanted to paste a table with state prison data into a table with state population data, don’t just paste in the prison data — paste in the state labels from that table as well. Then use a formula like =IF(B1=C1,”same value”,”different value”) to compare the state labels from each source table as a way check that the rows match by state and that there were no extra, missing, or differently ordered states throwing off your data. (See the example below to see this strategy in action.) Once you have confirmed the match, delete the redundant state label column and the now-unnecessary formula.
Example of using an Excel formula to automatically check that two tables copy and pasted together have compatible rows. The output “different value” is a sign that something has gone wrong in the attempt to merge the two tables together. (A real example would have lots of data pasted alongside the labels in columns A and B.)
| ||Column A||Column B||Column C
|Example row 1 ||Prison system ||State || ||
|Example row 2 ||Alabama ||Alabama ||=IF(A2=B2, “same value”,”different value”) ||same value
|Example row 3 ||Alaska ||Arkansas ||=IF(A3=B3, “same value”,”different value”) ||different value
- Make it hard to misread your data by setting the headings to bottom align and all other data to top align.
- Giving cells, rows, and columns name can make formulas easier to debug. For tips on this, see the 54-minute Joel Spolsky video about Excel. One warning on this video: The tone can be grating to people who aren’t in on the not-funny joke. Specifically, the presenter occasionally calls his employees in the video stupid for not knowing certain obscure things about Excel. To outsiders, this is pretty off-putting. What the presenter’s employees and most of the tech world know — but isn’t in the video — is (1) their boss has written countless books and articles about his secret sauce to find, hire and retain the best employees on the planet and (2) the real reason he knows this hidden stuff about Excel is because he worked on the Excel team back in the 1980s. Everybody in the room knows that he knows few people have this knowledge about Excel. On the internet, the training video plays differently, so if you have an alternative video, we’d be happy to hear of it.
- For a mail merge document, be sure to use consistent formatting and grammar in your data. This makes it easier to design the document without variations like extra spaces, capitalization, commas, and periods harming your output.
It’s outside the scope of the current training page, but a future version of this page will include some tips on formatting tables including alignment, bold, backgrounds, borders, column spans, decimal points/values, currency v. accounting format, etc.