Tuesday, November 24, 2015

U-M IT Symposium

Today, Max is presenting on our ArchivesSpace-Archivematica-DSpace Workflow Integration project at the University of Michigan IT Symposium, which is designed to "help create connections between community members, while showcasing the innovation occurring across all campuses."
Posters for the people!
We're trying to get the word out here on campus about our efforts to preserve and provide access to university records of long-term historical and administrative value.  We're hopeful that this kind of outreach will highlight our status as a trusted partner in the preservation of digital materials and drive more units to transfer their born-digital records to the archives.

Download a copy of the poster here.

Until next time...

Monday, November 23, 2015

Code4Lib 2016 Proposal

We've thrown our hat into the ring to present at Code4Lib 2016--please cast a vote for us if you'd like to hear more about the project and see a demo in Philadelphia next March!

Monday, November 16, 2015

Digital Objects and ArchivesSpace

One (somewhat unexpected) challenge in our ArchivesSpace-Archivematica-DSpace Workflow Integration project has involved mapping terms and terminologies across the different platforms.  In conversations with our development partners at Artefactual Systems, members of the ASpace community, and other peer institutions, we've found that it's really important to take a moment and make sure we're all on the same page when we're talking about something like a 'digital object.'

Having some common ground/shared understanding is very important, as our workflow establishes the following equivalences:

1 Archivematica SIP = 1 Archivematica AIP = 1 ASpace Digital Object Record = 1 DSpace Item

I'd like to take this opportunity to review the reasons behind this structure, but first I think it would be useful to take a look at how others in the ASpace user community are approaching digital object records.

Perspectives on the ASpace Digital Object Record

As evidenced by the introductory materials for an ASpace workshop that was held here in Ann Arbor this past January, the digital object record was designed to be flexible:
The Digital Object record is optimized for recording metadata for digitized facsimiles or born-digital resources. The Digital Object record can either be single- or multilevel, that is, it can have sub-components just like a Resource record. Moreover, the record can represent the structural relationship between the metadata and associated digital files--whether as simple relationships (e.g., a metadata record associated with a scanned image, and its derivatives) or complex relationships (e.g., a metadata record for a multi-paged item; and additionally, a metadata record for each scanned page, and its derivatives). One or more file versions can be referenced from the Digital Object metadata record.  The Digital Object record can be created from within a Resource record, or created independently and then either linked or not to a Resource record.
While this flexibility is great, it also provokes a lot of questions about just how to implement the digital object records, some of which have been featured in conversations on the ArchivesSpace Google Group as well as the ArchivesSpace Users Group.

On one end of the spectrum, we have complex digital objects--multilevel intellectual entities comprised of multiple bitstreams that can be represented in a structured hierarchy.  Brad Westbrook provides some examples of this use case in this thread from the ASpace Google Group. Those of us in attendance at the "Using Open-Source Tools to Fulfill Digital Preservation Requirements" workshop a couple weeks ago at iPRES got to see a real-world example of how a complex digital object could be represented in ASpace via content from the UC San Diego Research Data Curation Program in the Online Archive of California

Far more common (based upon conversations with peers and posts to the lists), is a simpler approach in which the digital object record is used primarily to record URL information that will provide links to content from the public ASpace interface or from <dao> elements in exported EAD.  This thread provides some valuable thoughts from Ben Goldman, Jarrett Drake, Chris Prom, Maureen Callahan, and our own Max Eckard.

Several of the important ideas raised in that conversation include the need for institutions to:
  • Define systems of record for data/metadata and determine how ASpace fits into this ecosystem.
  • Identify how information in the digital object records can be used now and in the future (i.e., the records can bring together digital content stored in various systems/locations, serialize information to EAD files, respond to queries via the API, etc.)
I won't attempt to delineate the different positions in the thread but encourage you to give it a thorough read!

Moving from this (very) brief review of the landscape, I wanted to identify some of our key assumptions here at the Bentley:
  • The general position outlined by Max is still accurate ("We're thinking of the DO module more as a place to record location than as a place to "manage" digital objects or the events that happen to them"): we are primarily interested in using the ASpace digital object module to create <dao> tags and links to content in EAD finding aids.
    •  We would therefore not be looking to include technical/preservation metadata about AIPs in the digital object record or do extensive arrangement with the digital object components.
    • With the above in mind, the ‘digital object’ records become somewhat analogous to physical ‘instances’--these are manifestations of the archival description expressed in the associated archival object record.
    •  In addition, within AS a digital object may be ‘simple’ or ‘complex’ (in the latter case, comprised of one or more digital object components).  We're now contemplating slightly more 'complex' digital object records...
    • We've also been working with Artefactual Systems and some other peer institutions to think more about how and where to record machine-understandable/actionable PREMIS rights information associated with digital objects.
  • Within the new Appraisal and Arrangement tab, a dedicated ASpace pane will display the ‘archival objects’ (i.e., the subordinate components) of a given resource record in a hierarchical structure. Within the ASpace pane, users will be able to create new archival objects and add basic metadata.

  • Within the appraisal tab, archivists will drag/drop content (individual files and/or entire directories) to a given ‘archival object’ in the ASpace pane.

    • All content associated with an archival object will be a single SIP/AIP in Archivematica.
    • Furthermore, each SIP/AIP will comprise a single ASpace ‘digital object’
    • 1 ASpace digital object = 1 Archivematica SIP = 1 Archivematica AIP = 1 DSpace item
  • We are not spinning off separate DIPs; we may configure Archivematica's Format Policy Registry (FPR) to spin off lightweight copies for some file formats, but otherwise the Archival Information Packages (AIPs) will serve for both preservation and access.
  • The Bentley's past/current use of DSpace is another factor here, as a single 'item' may contain one or more 'bitstreams' (i.e., files).  We therefore would like to be able to do some minimal arrangement of bitstreams within an ASpace digital object to control how materials will be deposited to DSpace.
    • Whenever possible, we strive to describe materials at an aggregate level, which means that a fairly large number of files (in number or space on disk) may be associated with a given 'item.' We also package content in .zip files to reduce the number of files we have to manage and that our users have to download.
    • To avoid presenting our users with extremely large .zip files that could be difficult to download and access, we often will chunk content across multiple .zips--i.e., instead of one 10 GB .zip, we will provide users with five 2 GB zips, as evidenced in this example from our Governor Jennifer Granholm collection:

    • In other cases, we might want to differentiate between access and preservation copies of materials in a collection. As an example, the following DSpace item includes an .mp4 access copy of a video recording while the .zip file contains an .iso image file of the original DVD:


    • We see the DSpace item as being the equivalent of the ASpace digital object record, with the individual bitstreams corresponding to the digital object components.
    • We won't be using DSpace forever (Michigan recently became a Hydra partner) and so we don't want to predicate our ASpace-Archivematica workflows on legacy systems.

Potential New Features

So...where does this leave us?  I wanted to talk through a possible arrangement workflow (based upon the new Appraisal tab) and how this might be translated into ASpace digital object records.  Let's see how this goes...

We've suggested the addition of an “Add digital object component” button in the ASpace pane (see above screenshot), which could function as follows:
  • A user would select a particular archival object in the ASpace pane and click the “Add digital object component” button.
  • Clicking the button will trigger the creation of a ‘digital object component’ that will appear as a child of the archival object.
    • Adding at least one digital object component essentially creates the main digital object record (which may include multiple components).
    • All the ‘digital object components’ nested under an archival object will comprise a single AS ‘digital object.’
    • In arranging the digital object components, users would only be able to work with 1 level of hierarchy--this will be very simple and minimal ‘arrangement.’
  • A digital object component will essentially be a bucket or a virtual container where one or more files and/or folders may be dragged/dropped.
  • To visually distinguish the ‘digital object component’ from archival objects, it should have a different icon (perhaps use the following from the digital object record in ASpace) and/or the text might have a different colored background.
  • The digital object component would display a default title, comprised of the associated archival object’s title and/or date and a consecutive integer. (In other words, for the archival object ‘Archivematica Series’, the first digital object component would be ‘Archivematica Series 1’, the next would be ‘Archivematica Series 2’ and so forth.)
  • The user would drag one or more files/folders on top of a digital object component. The file(s) and/or folder(s) would be nested under the digital object component. The following example has two digital object components:
  • The user can select a digital object component and click the ‘Edit Metadata’ button. This would permit the user to edit the only pieces of metadata required for digital object components, ‘title’ and/or ‘label’, as seen below in AS:

We've also thought about some simple rules for digital object components (and information packages), as well.  Once an archivist clicks the 'Finalize Arrangement' button, Archivematica will create a SIP for the materials associated with a given archival object and commence its Ingest procedures, which may result in the creation of preservation copies (or OCR text).  Based upon this:
  • If there is only one file, it will be deposited to DSpace as individual bitstreams.
  • If there is more than one file and/or a folder (including derivatives produced by Archivematica), everything in the digital object component will be included in a single .zip file (perhaps using the digital object component title) that will be deposited to DSpace.
  • Additional components of the AIP produced by Archivematica (the logs folder, metadata folder, and METS file) will be packaged in a .zip file and deposited as an additional digital object component (perhaps with some default file name). The Bentley would want this content to be be inaccessible to the general public (and ‘not published’ within the ASpace digital object record).
After Ingest processes are complete and the content has been deposited to DSpace, information will be written back to the ASpace digital object record.  The main (i.e., 'top level') digital object would by default inherit the title and/or date of the associated archival object, employ the DSpace handle for File URI (as well as identifier? TBD…), and have an extent (in bytes) that represents all associated content.  PREMIS rights information could also be written to the digital object record, though we'd love to hear from folks with thoughts about this (for instance, would the associated archival object be a more suitable location?).

The digital object components (i.e., each specific grouping of content as well as the Archivematica logs and metadata) would then be added as children of the main digital object record: 

 
The digital object component records might also include extent information, more specific rights information, or...???

It's been exciting to think about the possibilities of ASpace's digital object record, but the fairly wide-open nature of the endeavor is also daunting, as there's no established best practices to fall back on.  What do you think?  How are (or would) you proceed?  We'd love to get your feedback and/or reactions!