Among the earliest and most consistent academic advocates for the digital revolution have been scholars in the Humanities. The latest example is the Persian Digital Studies program at the University of Maryland’s Roshan Institute for Persian Studies. As the largest institute for Persian Studies in the U.S. the Roshan Institute has been leading its field in computational and digital analysis of the Persian literary tradition, focusing on areas such as digital preservation and the creation of an infrastructure to house digital texts. Original texts are scattered in libraries and private collections around the world. Researchers from the Institute recently made trips to Afghanistan and India, where they discovered vast troves of documents that had been painstakingly preserved but are in danger of destruction. The Institute’s work to build partnerships with collectors and institutes to preserve and digitize these collections has led to additional multinational projects, such as the Open Islamicate Texts Initiative (OpenITI) that brings together institutions such as the Aga Khan University in London, Leipzig University in Germany, and others. Computational text analysis also offers new ways of looking at old and stubborn scholarly questions. Digitized texts, for example, allow comparisons of broad ranges of literature spanning many centuries that a single researcher, or even a team of researchers, would simply not be able to accomplish. The work remains painstaking, however, requiring a unique blend of technological expertise and traditional learning to produce, for example, optical recognition of Arabic script and characters.
Matthew Miller, who leads the Initiative in Persian Digital Humanities at the Roshan Institute, where he is Associate Director and Research Fellow, talked to us about the work the program is doing and what it hopes to accomplish.
Tell us a little about yourself
I am a scholar of Persian literature and culture. My work focuses in particular on medieval Persian Sufi poetry, gender and sexuality studies, and Persian digital humanities.
What is the Roshan Institute and what are some of its goals?
The Roshan Institute for Persian Studies is one of the premier centers for the study of Persian Studies in the world. It aims to increase knowledge and understanding about Persian culture, in all of its historical and contemporary forms, through research, education, and public programming.
How did the Persian Digital Studies Program evolve and what does it hope to achieve?
The Roshan Initiative in Persian Digital Humanities (PersDig@UMD) was started in the fall of 2015 after a one-year Persian Digital Humanities Working Group at UMD recommended its formalization as an official initiative of the Roshan Institute for Persian Studies. It aims to drive the development of Persian Digital Humanities research through the construction of digital infrastructure for Persian Studies and the promotion of Persian DH research. The PersDig@UMD initiative has started three separate projects, the Lalehzar Digital Project, the Persian Manuscript Initiative (PMI), and the Open Islamicate Texts Initiative (OpenITI).
To briefly summarize, the Persian Manuscript Initiative focuses on the preservation of Persian manuscripts throughout the world and building the digital infrastructure to bring the field of Persian manuscript studies into the digital age. The Lalehzar Digital Project similarly focuses on digital preservation, but its work is focused on the cultural production of Lalehzar Street (in Tehran, the "Champs-Élysées of Iran"). Finally, the OpenITI project is working on building the first open-access, machine-actionable, and scholarly-verified Perso-Arabic corpus focused on the pre-modern period.
Who are some of the partner institutions and what collaboration exists between UMD and them?
In the summer of 2016 we officially signed an agreement with the Library of Congress to do a “knowledge-sourcing” initiative on their Persian manuscript collection, and we have partnered with the Hill Museum and Manuscript Library (HMML, a non-profit institute dedicated to the global preservation of ancient manuscripts) to display and annotate the digital manuscripts produced in this project through their new vHMML Reading Room. No manuscripts are available yet for public viewing. We do expect, however, to release the first twenty Library of Congress Persian manuscripts through vHMML by the summer of 2017.
Secondly, with the help of the The Islamic Manuscript Association (at Cambridge University) and the Roshan Institute for Persian Studies, and in collaboration with the HMML we have been working on a preservation, cataloging, and (we hope eventually) digitization program with the National Archive of Afghanistan. In April of 2016 we sent a site visit team there, led by the renowned scholar of Persian manuscripts, Dr. Francis Richard, and Noshad Rokni (Malek Museum). They did a complete preliminary curatorial evaluation of their manuscript collection and ran several educational workshops on codicology and preservation for the archive staff. The preliminary report on the first phase of our project with the National Archive of Afghanistan can be found here: http://www.persianmanuscript.org/blog/2016/10/17/pmi-releases-status-report-on-the-nation-archives-of-afghanistan.
In the fall (2016) we hosted an event called the “Manuscripts in the Digital Age Workshop.” This workshop brought together technical specialists from the Perseus Digital Library, Digital Latin Library, Schoenberg Institute for Manuscript Studies (UPenn), Omeka, and the Maryland Institute for Technology in the Humanities (MITH) to discuss the technical challenges to the creation of an online "manuscript workspace." You can learn more about this workshop here: http://www.persianmanuscript.org/blog/2016/10/18/nov-4th-5th-manuscripts-in-the-21st-century-workshop
Finally, we just commenced a project with the Raja of Mahmudabad Palace Library in Lucknow, India. This project is done in collaboration with our digitization partner, the Hill Museum and Manuscript Library, and its goal is to digitally and physically preserve the collection there, which is one of the most important privately held Perso-Arabic manuscript collections in India. You can learn more about this project here: http://www.persianmanuscript.org/blog/2017/4/21/new-project-announcement-digitizing-the-raja-of-mahmudabad-palace-library-manuscript-collection
What do you hope the program will accomplish?
First and foremost we need a corpus of classical Persian texts that contains scholarly-verified, metadata-enriched, open-access, and machine-actionable texts. I cannot emphasize this enough; it is really a daunting task which will require a great deal of time and resources to develop up to scholarly standards. And, equally important, this corpus needs to include more than just poetry (which is the vast majority of the texts that are currently available as open-access texts). The computational study of Persian literature will not develop without this basic component. We are way behind Arabic Digital Humanities in this respect. We have approximately 10 million words of Persian in open-access texts whereas Arabic has over 1 billion words!
Secondly, we are working on Persian OCR technology because we need to increase the quantity, quality, and representativeness of open-access digital Persian texts. We have spent most of the last 6 months working on the problem of OCR for Arabic-script languages (such as Persian) and, again, preliminary results (done in collaboration with Maxim Romanov and Benjamin Kiessling of Leipzig University and Sarah Bowen Savant of Aga Khan University, London) can be seen here: http://www.academia.edu/28923960/Important_New_Developments_in_Arabograp....
We are currently working on an OCR "pipeline" that will allow users to post-correct the texts they run through our OCR program and submit additional training data so we can continue to improve the accuracy of our OCR. The development of open-access Persian OCR is critical for the advancement of the field: it will allow us to dramatically expand the number and quality of digital texts that we can study.
While improving and enriching the existing open-access digital texts and building an open-access Persian OCR pipeline are critical, we also need to work on the digital preservation of Persian manuscripts and other material culture objects and, secondly, making the objects that have been digitized open-access and compliant with international standards (such as the new IIIF framework) so that they can be widely shared. There is this stubborn notion that if museums, archives, or universities make their digital collections open-access that will devalue them. This is patently false. The studies that have been done on this question have convincingly shown that institutions that make their digital collections openly available on the web actually see increased traffic and interest in their collections. It seems counter-intuitive, but just think, what do people go see when they go to the Louvre in Paris? They go see the Mona Lisa. Why? Because they have seen reproductions of it a million times in their life. It is famous because it is everywhere; not because it is locked away in a museum. We need to learn from this and work to both digitize Persian manuscripts, art, historical documents, etc. and make them freely available on the web. For our part, we are working to digitize Persian manuscripts through our collaboration with the Hill Museum and Manuscript Library—one of the world leaders in cultural preservation and digitization. As mentioned above, we are currently working on Persian manuscript digitization projects in both Afghanistan and India, and we are actively looking for other projects.
Why do you think it is important?
We live in a digital age, for better or for worse. Other humanities fields have made the jump into the digital age by creating digital corpora, digital repositories, etc. The field of Persian Studies needs to do this as well. What we hope we can do in the PersDig@UMD initiative is to help catalyze that transition by working to create components of the requisite digital infrastructure, such as machine-actionable corpora, Perso-Arabic script OCR, repositories of digital manuscripts, etc.