Big Data, DH, Gender: Silence in the Archives?

In January, February, and March 2012, a lot of ideas and scrambling began that will eventually culminate in panels at the 2013 MLA Convention in Boston. I helped (using that word generously) with a few panels put forth by MLA discussion groups (to which I’ve been elected) including the panel sponsored by the Discussion Group on Computers in Languages and Literature and another panel sponsored by the Discussion Group on Bibliography and Textual Studies.

Alan Galey, the organizer for the Bibliography & Textual Studies panel, organized me right onto that panel about Digital Archives and Their Margins where I will talk about some of the issues outlined below:

After commenting on Ted Underwood’s tremendous undertaking, reading Miriam Posner’s blog post, “Some things to think about before you exhort everyone to code,” and reading the really interesting (and enormous) set of comments by the DH community on both posts, I was moved to tweet about a recent data set. Romanticism and Victorianism on the Net came out with its latest journal edition which includes an interesting article about big data, aesthetics and the long 18th century in literature. (Yep, as a Romanticist, I too bristle that some fields persist in trying to subsume Romantic-era literature into the long 18th or 19th centuries…but that’s another story about administrative politics in underfunded departments.)

Updated to add: I’ve also been engaged in some fairly exciting conversations with Jacque Wernimont (see her post on Feminism and Digital Humanities) about all of these topics (and I would argue, her scholarship has pushed me further in my own areas).

All this has got me thinking about that MLA panel for 2013 and returning to a topic that’s close to the Society for Textual Scholarship 2011 conference panels on feminism, textual studies, and Digital Humanities where nothing was resolved except a stark articulation of gender differences in all of these fields.

As is the thing to do, I sent a concerned tweet which Roger Whitson immediately picked up — and that lead to an engaging and interesting conversation along with Lauren Klein. Natalia Cecire storified the entire conversation for us, “From Archival Silence to Glorious Data.” And, possibly, an MLA panel has been borne about text mining and textual criticism. Lauren is working on a longer project about this very topic, archival silence and topic modeling.

Roger suggested (post-storify) that

my hope is to be able to find a way to express that absence algorithmically. But I’m being utopian.

The conversation left me with an even larger question (post-storify):

Were British 19th C women more prone to publish in single-author writings or as part of a newspaper, magazine, journal, anthology, etc.?

Off the top of my book history hat, I can’t think of a scholarly project that answers this question definitively. We have case studies, but this explosion of accessibility and data mining has proved that some of these case studies may not be all that universal.

Print culture exploded in the early 19th Century. There are so many documents and texts to digitize that it’s become the job of libraries, who have deeper pockets than some, to curate these collections. And, now Google Books, ECCO, and HathiTrust have become the custodians who also perform the labor of digitization and mark-up. (There are issues with corporations taking over cultural materials, but that’s another topic.) These smaller digital projects, the ones that are usually full of that ephemeral stuff by the non-canonical people, typically languish at this stage, that digitizing and mark-up stage, because the individual whose passion fuels the project has lost some institutional support or funding.

…aaannnndddd, now we come full circle to the conversation about professionalization and what counts — hence my post on doing the risky thing with this gothic stuff.

Nevertheless, the big data sets that are in play in this conversation (on Twitter and here in this post) in both projects are ones that were created by other institutions. If the traditionally marginalized authors are marginalized now because it’s no longer sexy or innovative to digitize and mark-up those collections, then how have we far have we really come? Are those recovery projects then marginalized because they bring nothing innovative to Digital Humanities?

[Caveat: This claim, let’s be clear, is not based on funding figures from the NEH Office of Digital Humanities — that would be an interesting set of numbers to crunch, though: those under-represented peoples as the topic of digital projects. And, to be fair, again, other departments in the NEH are now funding digital projects. Can we obtain those numbers, too, to discover if, say, the Scholarly Editions NEH grants are being awarded to projects that are about creating scholarly editions of a marginalized set? and the funding would primarily support the huge labor of digitizing and mark-up? I write this as my Beard-Stair students struggle with the next step of the project, creating a digital representation of their work with an out-of-the-box platform.]

Book historians and print culture scholars seem uniquely positioned to answer some of these questions because of their propensity for doing big projects (I mean, lifelong career projects) that cover wide swaths of literary history and culture. Lisa Maruca threw down the gauntlet today to book historians to take up this cause about silence in the archives. Ok, I’ll do it. Have been for awhile. Will be for a very, very long time. Most of my opinions about Digital Humanities comes out of work in print culture and book history, especially on the British literary annuals: some 3000 volumes of poetry, prose, translations, travel narratives, landscape engravings, portraits, women authors, Romantic/Victorian authors, editors, publishers 1823-1860. There’s a treasure trove of materials just waiting, waiting, to be digitized and then mined. But, we’ve never gotten enough money to fund the laborious digitizing and mark-up that’s required. [sigh]

So, how about it other print culture and book historian types? How about you? Ready to take up this cause?

But, we’re not ready to stop talking about Big Data. Before I could even finish this post, Ted responded via Twitter:

Re: representativeness of collections. I think humanists are still implicitly thinking about canonicity, which is a zero-sum game.

Collection-building is not a zero-sum game. I build collection X, you build Y, she builds Z, we all go public. Then other scholars can

select whatever subset seems to them “representative.” We’re used to assuming there must be a conflict here — but I don’t see one.

Every single book you digitize / normalize / mark up is good news for me, even if you have a view of “representativeness” I don’t share.

Fair enough. Ted’s project is to use the corpus that is available. He’s moved his project beyond that labor of creating the digital representatives and is working on the humanistic queries that are so engaging (to me) in Digital Humanities. To be fair, Ted has promised to return to his data set to add more women authors from the Brown Womens Writers Project and such. He gets into some of this in a longer response to representation, big data, and the canon.

But, I return to this ethos: do we have a responsibility to acknowledge the lack, this silence in the archive?

…to be continued at MLA 13….or in the comments below.

(Just a note about scholarly engagement: This is why I really enjoy blogging, tweeting, and other forms of more immediate writing: a conversation begun with Ted’s original post a few days ago has become multiple posts for multiple types of scholars who are commenting in real time on these very cogent, and sometimes urgent, questions.)

Update 3/3/12 10:12pm: Both Jacque Wernimont and Roger Whitson respond to these and related questions circling around the DH in this post. Michael Kramer’s post on DH as Process got me thinking about platforms for exposing digital projects to the world. And the conversation among librarians and archivists over at Kate Theimer’s blog resulted in lengthy comments that indicate a division between scholars and archivists: “The problem with the scholar as “archivist,” or is there a problem?”

[This and a whole series of blogs on digital feminism wound up as an Editors’ Choice for Digital Humanities Now, March 5, 2012 in addition to sparking some interesting doings with THATCamp, articles, contributions]

Update 1/23/16: This discussion is cited in Susan Brown’s article “Networking Feminist Literary History: Recovering Eliza Meteyard’s Web” in Virtual Victorians: Networks, Connections, and Technologies, eds. Veronica Alfano and Andrew Stauffer, Palgrave (2015). Incredible that this discussion lives now through several mediums: Twitter, blog posts, storify, printed article in an edited collection.

Digital Humanities, Editing, feminism, literary annuals, romanticism

digital humanities, editing, feminism, gothic, literary annuals, romanticism

March 3, 2012

18 Comments

18 comments

Roger Whitson (@rogerwhitson)

Great post. My sense is that this issue of representation (coming off of Ted Underwood’s set of tweets) is really an issue about Linked Open Data. I’m increasingly of the opinion that questions of representation are really about questions of access to datasets. If we work w/in Google’s world, we’ll have to submit to the reality they present to us, and the assumptions made when constructing their archives. If we work to make datasets that are linked and open, we can create the kinds of projects that highlight the role of non-canonical ephemera (and race, gender, class, sexuality) had in constructing literary culture.

LikeLike

Permalink, Reply

March 3, 2012 12:02 pm

tedunderwood

I feel a little bad commenting, because it’s already clear in Katherine’s post that I’ve been talking sort of nonstop all week 🙂

But thanks, Katherine: it’s an important topic, and I’m looking forward to the MLA panels you describe.

I agree also with your point about Linked Open Data, Roger … except that I’m a little apprehensive that DH will make the perfect the enemy of the good. Like TEI, Linked Open Data sounds to me like a great, important long-term initiative. But what I feel we need *right now* are moderately clean plain-text files accompanied by author, title, and date of publication … and whatever other kinds of metadata individual researchers want to add.

There’s a heck of a lot we could do with that sort of collection, and it’s an immediately achievable goal rather than something that requires endless focus-grouping. I also feel confident that data organized this way can support inquiry into race, gender, class, and sexuality. The data I got from TCP didn’t include anything about genre, author’s gender, or national origin, but we added that metadata by hand where we needed it. The hard part is getting a moderately clean text in the first place. At the scale we’re dealing with — which is in reality only a few thousand texts — social metadata can be added by individual researchers in a couple of afternoons.

LikeLike

Permalink, Reply

March 3, 2012 1:40 pm

tedunderwood

Oh, and just to clarify: I haven’t given up the work of collection-building. Working together, Jordan Sellers and I built the 19c part of our collection (1600 vols) over the last nine months. I’m not digitizing ephemera, to be sure: I’m working with what’s already in the Internet Archive. But my feeling is that there’s already a lot in that archive, and what we really need at the moment is the kind of minimal processing that turns dirty OCR into a semi-consistent, imperfect but usable collection.

LikeLike

Permalink, Reply

March 3, 2012 1:47 pm

triproftri

Thanks for your response, Ted. Where I differ from some doing “big data” is that I want to then collect those “collections” into an archive that’s edited by some scholarly. I have the scholarly editors eye towards DH projects. With big data, even that shared data that we really should be participating in, there’s an implicit demand that the experiments become empirical. A scholarly edition/archive/collection/arsenal/knowledge-base demands that the data be exposed, organized. Then the scholarly editors of late are doing something — opening up that collection for crowd-sourced information so that the texts become stable enough for everyone to provide further information.

This is just to say….

the big data should be left alone
in the fridge
for others to gaze upon
and then eat later.

I don’t want the data as a set to disappear. I want it to appear and become re-mixable. That social digital edition akin to the Devonshire Manuscript Project (http://earlymodernonlinebib.wordpress.com/2012/02/29/the-devonshire-manuscript-a-digital-social-edition/). But, I suspect that we come from two different camps. Ultimately, I need a platform to house all of this display and demonstration. Then, I want to mine the texts for interesting turns.

Where do data miners meet scholarly editors?

LikeLike

Permalink, Reply

March 3, 2012 2:55 pm
- tedunderwood
  
  Probably at 18thConnect and NINES! Which is to say that I do understand what you’re saying, and I admit that is probably the best medium-term approach. With crowd-sourcing tools of the kind 18thConnect is developing, we can gradually refine the texts so that they become more reliable, and there’s also a platform there to permit search & display. And, with a bit of luck, it ought to be possible to build certain kinds of text-mining facilities into those sites as well. In short, I agree.
  
  But I’m also not very patient … so this is just to say / that I’m going to eat the data / while it is in the icebox / because it seems like a long time till breakfast.
  
  LikeLike
  
  Permalink
  
  March 3, 2012 3:17 pm

jwernimont

In the context of Ted’s suggestion that canonicity isn’t or shouldn’t be the game we’re playing, I find myself wondering about who all of this “big data” is serving and how the implied volumes of archives make their way into classrooms and popular culture. I’m thinking of some recent work by Alex Juhasz on the Women’s Building in LA. Writing in GLQ 17.4 she describes being overwhelmed by the sheer volume of the video and material archive. Juhasz asks: “What and who is this tape for? Why was it archived?” She goes on: “so moved were they by their own present that it seems they planned for a future littered with the documents that they needed then.” Juhasz is pointing to the ways in which an archive responds to its moment of construction, in this case to the needs of the women at the Women’s Building to record their lives, to mark their processes as worthy of recording after so many years of exclusion. But as her experience of that archive just one generation later attests – that archive is both a failure and a success. Successful as a record of the feminist process, as a record of the archival desires and needs in that moment. A failure in its sheer volume – something that would take a lifetime to process let alone make sense of. The question then is what do *we* as users now do with a “littered” archive – what do we do with a superabundance of data? I find myself wondering if there aren’t parallels – are we really archiving for ourselves? If we are, is that a problem? John Guillory’s work did such a nice job of demonstrating how deeply intertwined notions of canon, curriculum, and cultural capital are that I have a hard time thinking that we can really just dispose of the issue of canon and celebrate all of this great open space.

LikeLike

Permalink, Reply

March 3, 2012 7:51 pm

triproftri

Thank you for the very detailed and cogent response, Jacque. My point of entry here is using the term “archive” as interchangeable with big data. With digital archives, we’ve moved beyond thinking about them as a messy heap. In fact, many librarians/archivists argue that the archive is already implicitly theorized and edited based on the original impetus. And, Derridean moment here, the archon contaminates as much as it saves. The problem is that the archon is not always aware of that contamination. (David Greetham talks about this in _Contamination_). So, later users are overwhelmed, but I would ask if this is something new? The Downtown Archives at Fales Library were a messy heap of objects, manuscripts, tearings, mummies (yes!), printed books — and some of the collections were organized already by the artists/authors. The curator, Marvin Taylor, had a hell of a job figuring out a structure for organizing the materials — and then he had to tackle the metadata for such a project.

Perhaps digital archives can offer entry points into the material that are more bite sized for the public and then re-mixable for deeper study? Martha Nell Smith, of course, has been talking about this kind of dynamic archive for a couple of decades. But, we haven’t yet created a platform that allows for all kinds of re-mixing and re-using. And, this why (in my opinion) digital archives haven’t infiltrated as much into the classroom. A semester is so incredibly rushed that it’s difficult to get into digital resources unless there’s much more in-depth planning by the instructor. But, what if we began using those students, that classroom, those publics to crowd-source information? Wouldn’t that draw them in?

And how responsible are archives for anticipating future questions about their material involved?

As with *all* of my scholarly endeavors, I only have questions, not answers. But they sure are a fun group of questions.

LikeLike

Permalink, Reply

March 5, 2012 10:38 am