This paper investigates the effect of including a parser network, which produces syntactic heights and distances to perform unsupervised parsing, in the Every Layer Counts BERT (ELC-BERT) architecture trained on 10M tokens for the 2024 BabyLM challenge. The parser network‘s inclusion in this setup shows little or no improvement over the ELC-BERT baseline for the BLiMP and GLUE evaluation, but, in particular domains of the EWoK evaluation framework, its inclusion shows promise for improvement and raises interesting questions about its effect on learning different concepts.
@inproceedings{BehrBabyLM,title={ELC-ParserBERT: Low-Resource Language Modeling Utilizing a Parser Network With ELC-BERT},author={Behr, Rufus},editor={Hu, Michael Y. and Mueller, Aaron and Ross, Candace and Williams, Adina and Linzen, Tal and Zhuang, Chengxu and Choshen, Leshem and Cotterell, Ryan and Warstadt, Alex and Wilcox, Ethan Gotlieb},booktitle={The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning},month=nov,year={2024},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2024.conll-babylm.11/},pages={140--146},dimensions={true},google_scholar_id={9yKSN-GCB0IC},}
This paper identifies the system used for my submission to EvaLatin’s shared dependency parsing task as part of the LT4HALA 2024 workshop. EvaLatin presented new Latin prose and poetry dependency test data from potentially different time periods, and imposed no restriction on training data or model selection for the task. This paper, therefore, sought to build a general Latin dependency parser that would perform accurately regardless of the Latin age to which the test data belongs. To train a general parser, all of the available Universal Dependencies treebanks were used, but in order to address the changes in the Latin language over time, this paper introduces historical sentence embeddings. A model was trained to encode sentences of the same Latin age into vectors of high cosine similarity, which are referred to as historical sentence embeddings. The system introduces these historical sentence embeddings into a biaffine dependency parser with the hopes of enabling training across the Latin treebanks in a more efficacious manner, but their inclusion shows no improvement over the base model.
@article{BehrEvaLatin,title={Behr at EvaLatin 2024: Latin Dependency Parsing Using Historical Sentence Embeddings},author={Behr, Rufus},editor={Sprugnoli, Rachele and Passarotti, Marco},booktitle={Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024},month=may,year={2024},publisher={ELRA and ICCL},url={https://aclanthology.org/2024.lt4hala-1.22},pages={198--202},dimensions={true},google_scholar_id={d1gkVwhDpl0C},}
2023
School
Undergrad Thesis on Author-styled Latin text Generation
Due to the unique challenges presented by low-resource languages compared to their high resource counterparts, common tasks concerning the style of low-resource language text may require less straightforward approaches. To explore the validity of this claim, this paper investigates authorship attribution, also known as author classification, a common stylometric-based task, in a low-resource language setting — namely, Latin. In doing so, it sheds light on potential useful features for subsequent stylometric tasks for Latin