Herbarium Data Analyst/Coordinator

California Academy of Sciences

About the Opportunity
As part of the biodiversity science efforts embedded within the Thriving California Initiative, the California Herbarium Specimen Digitization Project will make hundreds of thousands of specimens from the California Academy of Sciences (CAS) herbarium collections available online. This project combines the efficiency of high throughput specimen imaging using conveyor belt technology to image California herbarium specimens housed at CAS. Specimen images and associated data records will then be uploaded to a community science platform for further transcription and georeferencing of label data. Afterwards, fully transcribed and georeferenced records will be imported into CAS collections database and linked to their corresponding images. Results from this project will mark a major step forward in democratizing CAS museum collections, providing equitable access to these important specimens for people (botanists, scientists, and the general public) all over the world.


About the Botany Team
We are a team of botanists, scientists, professionals and enthusiasts that collectively curate the Academy’s collection of over 2.3 million herbarium specimens. This position will broadly support adding collections imagery and label data to the CAS botany database. This will include working with scanning contractors, ingesting and cleaning label data, working with community science organizers to crowdsource data entry, OCR, georeferencing, and all related processes and technologies. The role will be responsible for ensuring that the data entry is as efficient and correct as possible, by creating processes, working with colleagues, and by using scripting/programming to automate said processes as needed.


Key Responsibilities

  • Work with the Botany Curator, Collection Manager, and Director of Scientific Computing to identify requirements for data import/export to/from contracted imaging and transcription services to the CAS internal database and computational infrastructure. This includes:
    • Scripted export to and ingest from community science data organizers
    • Scripted imagery and transcription ingest
  • Coordinate with contractors to implement workflows and pipelines that are in line with the needs of internal CAS databases and computational infrastructure
  • Develop, test and modify workflows and pipelines to georeference specimens using transcribed label data
  • Develop, test and modify (as needed) workflows and pipelines to achieve high level quality control, modification and/or data reshaping as images and associated records move from one place to another; regularly test and modify workflows and pipelines, as needed
  • Coordinate QC and data modification efforts with other digitization technicians
  • Coordinate with contractors to alter data delivery techniques and/or formats as needed
  • Coordinate with collection preparators to maximize data collection efficiency


A qualified person for this position is capable of working with large datasets without seeing each piece of data individually. This person is capable of working with data in multiple formats and can modify data to suit different software and application needs. This person has either a background in the natural sciences with extensive database and programming experience, or has a background in bioinformatics and/or computer/data science with coursework and interest in the natural sciences.

Experience and/or Education:

  • Undergraduate degree required, Masters degree (or higher) preferred
  • Experience with building, managing, and/or maintaining SQL databases
  • Experience working with large data, including cleaning/validation/transformation, clustering, and formatting.
  • Working knowledge of Python and preferably at least one other high level language suitable for data analysis (e.g., R) and techniques (regular expressions, parsing, reading in formatted data, etc)
  • Comfortable (ideally expert) with Linux command line and bash scripting (bash, ssh, scp, rsync, awk, etc).
  • Comfortable with task automation using scripting and programming tools
  • Knowledge of data cleaning tools (OpenRefine, Trifacta, etc) and techniques
  • Working knowledge of common data formats (JSON, yml, csv, tsv) and issues therein (unicode, whitespace, etc)
  • Knowledge of biological data systems (GBIF, Encyclopedia of life, NCBI, iNaturalist, etc) and familiarity with geospatial data.
  • Knowledge of taxonomy and classification, ideally botanical.
  • Experience working as part of a team, with both independent and collaborative goals


Hourly hiring range: $36.06-$38.46 per hour. Hourly rate will vary based on experience and relevant skills/knowledge set. The Academy offers a total compensation package that emphasizes both base salary and comprehensive benefits.
Schedule: This is a full-time position, 40 hours per week, and is primarily on-site with occasional remote work possible. This is a temporary position with a duration of 24 months.


THIS IS A SHORT TURN-AROUND! This position will close on April 10th, 2023 at 9am. Review of applications will begin on April 10th, 2023.

To apply for this job please visit californiaacademyofsciences.applytojob.com.

Comments are closed.