Monday, October 24, 2022

How to become a data-centric rather than map-centric GIS user

We’ve all seen them. The colleague that has 30 or 40 layers in a map document. It usually takes 15 minutes each time he/she refreshes the map. Each time their data gets moved their world is destroyed because every data link is broken. Migrating from one software (e.g. ArcMap, ArcGIS Pro, QGIS) to another is equally or even more traumatic. A corrupt map document might spell the end of their project/career.

Don’t be that person! By following a few simple rules and sticking with them you can transform yourself from a map-centric GIS user to a data-centric GIS user greatly enhancing your productivity and making life for your colleagues much easier. You can even take things a step further by preaching the gospel of GIS data-centrism to your colleagues or sharing this document with them. Here are my simple rules that I’ve developed over 20+ years of GIS work and by watching what others before me have done.

1.  Only use map documents for true cartography rather than as a way to locate (re-locate) particular GIS layers. If you have lots of layers that aren't visible in your final map because they are checked off this might be an indication that you have too many layers in your map. Consider also that having lots of layers means that it takes longer to load each time your refresh. Keep the number of layers to a minimum and only turn on what you need when you need it. This is especially true for basemaps that get streamed in. Those can really slow the refresh times!

2.  Create a folder structure that makes sense to you using true names and logical groupings of layers (i.e. all of the hydrology and water-related themes might be grouped together and placed in a folder and all of the administrative and boundary layers together and placed in a folder).

3.  In ArcGIS and QGIS keep the number of connected folders to a minimum. Hopefully by having a good file structure (see above #2) you can quickly memorize where your folders are located.  In the example above there might be administrative and hydrological folders within the same project. It would be redundant to connect to both of them. A better option might be to connect to the folder at the root level.

4.  Differentiate between project data and organizational data. Organizational data is likely to be shared widely across your organization and used by many users across departments and the data sets are likely to be used indefinitely with a strong need for periodic updates. In contrast, project data is likely to be used by one or a small number of people for a limited time span. A third type of data would be departmental data, which has many of the characteristics of organizational data but is used by a smaller number of users.

5.  Use temp folders for staging data. Data that is downloaded in its raw format can be stored in a folder. Partially processed data that is not yet ready to share should be stored in a temp folder within your greater project/department/organization folder, not a c:\temp or default geodatabase. Resolve the latter type of temp folder for true garbage. Intermediate geoprocessing steps can be saved to those truly temporary places if they are unlikely to ever be used again (i.e. clipped a layer and then immediately dissolved it in a second processing step).

6.  Design and think about what your final data structure will look like in advance. Which datasets would you share with your funders/clients/collaborators?  Don't make them dig through garbage to find the data that they need.

7.  If working in a multi-user environment consider adding dates, creator initials (e.g. td), and source information (e.g. usgs, pad-us, etc. in the name) so that people can quickly identify who was last working with the data, where it came from, and how old it is.

8.  Use metadata and readmes abundantly. Super detailed metadata isn't needed 100% of the time, but learning to use metadata effectively can help you remember where your data came from! Identify the minimum metadata requirements, such as where the data came from, and your contact info and add those data to any dataset that you share with others. Use readme files to describe your file structure.

9.  There is nothing more annoying than having map documents with broken data links. In ArcMap one partial remedy is to store your data using relative pathnames. File -> Map Document Properties -> check the "Store relative pathnames to data sources" box. Another simple solution outside of GIS is to create a text file or Word document that lists the names of all of the pathnames. That way if a drive letter changes you'll always know where your file is located. This can be especially useful if you know that you will need to revisit a map document months or even years later.

10. Use layer files, colormaps, and layer styles to save symbology and then keep these files with your vector and raster data. Map documents will sometimes become corrupted. Saving symbology is a critical first step towards ensuring that your layers look the way you want to present them to others.  Name the layer files/layer styles in an unambiguous manner so that they match the original reference data.

11. Consider placing your GIS layers in the cloud as streaming content. For example, ArcGIS Online and DataBasin are two excellent places to make data available. This will provide one of the easiest formats to share data with others without requiring them to have lots of GIS knowledge.

12. Use definition queries to minimize the number of layer exports. For example, you could select all of the private parcels and export them to a separate shapefile and all of the public parcels and export them to a separate shapefile and end up with three shapefiles (public, private, and original). Alternatively, you could use definition queries and just keep a single shapefile.

13. Use a GIS rather than Windows Explorer for GIS data management. Beginner GIS users tend to get confused over why there are multiple files with the same name. Microsoft Windows default of "removing known file extensions" makes it doubly confusing when it comes to GIS data (in particular .tif). GIS software is designed to bundle all of these files together and make them appear as one file. Use the GIS to avoid confusion and to avoid potentially corrupting files by moving (or renaming) one and not the other.

14. Academics should consider using data repositories such as Dryad or ScienceBase to share their data with others. Not only will it make the data available to others, but it will force you to organize and document your files.

15. Document your processing steps as you go. It can be difficult to remember what you did when to what file years later. Document your processing steps as you go in a text file or Word document. Include web links to source data and specific processing steps and settings. Make sure to include the dates that you did each step. ArcGIS has some tools for tracking processing steps in which they are automatically tracked in metadata and a geoprocessing history toolbox. In general, good documentation will eliminate the need to use these tools and you’ll only need to resort to them when you've forgotten to include a step.

16. Avoid geodatabases in favor of geoTiffs and shapefiles. Geodatabases and other proprietary data formats have their place, but they don't play well with other software (i.e. R, especially). For simplicity sake stick to the more general file formats unless you have specific needs for a geodatabase (e.g. topology, relationship classes, etc.). If you want to share the geodatabase consider also sharing the shapefile/gotiff version of the data.