The Basel Avisblatt (1834), p. 53.
After the digitization of the Avisblatt by Basel University Library (6625 issues from 1729 to 1844 with a total of 47,495 pages and around 750,000 adverts), the digital collection has been uploaded into freizo, an online database and working environment developed by data futures. Every year and every issue is now available as IIIF manifest, and the collection will be published online by 2020. For annotation purposes, we enhanced the annotation mask integrated in Mirador to categorize the single ads, building up a database.
To generate full text, we have trained two different HTR+ models for the gothic print before and after 1800 by using Transkribus. The Character Error Rate is below 1%, so the accuracy of the recognized text is very high. Given the unstandardized writing, text mining will have to deal with irregularities and build vocabularies that could be useful for other sources of the same time and area.
The layout of the Avisblatt changes over the years, switching between one, two and three columns, with multi-column or multi-page ads and special layouts for price lists, annual indices etc. We are currently training a layout model with dhSegment that aims at a combination of automated layout recognition and post processing. The resulting coordinates for single adverts are fed into Transkribus, generating full text page xml on advert level.
Every ad is annotated and classified on a first level, according to content, type and transaction mode. After the first annotation step, the researchers in their subprojects can filter subsets and, with adapted masks, classify further, according to research interests. Up to now, the classification is done manually. After annotating five years, the classified advertisements will be used for machine learning to enable the classication of all ads in the whole corpus on a first level. We will keep you up to date!