This notebook downloads and processes the gender data from the Vienna catalog. The data comes from a gendering add-in for Microsoft Word 2010 that has been developed by Microsoft. The data includes two styles (double notation and inner I).

Some more manual normalization would be necessary to make this data useful for our project. For example, the inner I and double notation forms can both be derived from just the female form (in addition to the male form, which is already given by the replaced word), and entries for the same word but with different cases could be reduced to a single entry.

The data is highly relevant for this project, because it has been created in a government context as well, and includes many government-specific words.

import io
import pandas as pd
import re
import requests
from typing import *
import sys

sys.path.insert(0, "..")
from helpers import add_to_dict, log
from helpers_csv import csvs_to_list, dict_to_csvs
csv = requests.get(
    "https://www.data.gv.at/katalog/dataset/15d6ede8-f128-4fcd-aa3a-4479e828f477/resource/804f6db1-add7-4480-b4d0-e52e61c48534/download/worttabelle.csv"
).content
text = re.sub(";;\r\n", "\n", csv.decode("utf-8"))
df = pd.read_csv(io.StringIO(text))
df.head()
Laenge Hauptwort Vorschlag Binnen
0 50 Verantwortlicher für Informationssicherheit (C... CISO N
1 50 Verantwortlicher für Informationssicherheit (C... Verantwortliche bzw. Verantwortlicher für Info... N
2 45 Diplomierte Gesundheits- und Krankenschwester Diplomiertes Krankenpflegepersonal N
3 43 Unabhängiger Bedienstetenschutzbeauftragter Unabhängige Bedienstetenschutzbeauftragte bzw.... N
4 39 Kontrakt- und Berichtswesenbeauftragter Kontrakt- und Berichtswesenbeauftragte bzw. -b... N
df.to_csv(
    "vienna_catalog_raw.csv",
    index=False,
)

We change Binnen-I to gender star to have one simple style, and we try to attribute singular and plural as well as possible:

dic: Dict[str, Dict[str, List[str]]] = {"sg": {}, "pl": {}}
for (_, _, ungendered, gendered, binnenI) in df.to_records():
    if binnenI == "Y":
        gendered = re.sub(r"([a-zäöüß])I", r"\1*i", gendered)
    if type(gendered) == str:  # skip ill-formatted rows
        if (
            re.findall(r"[iI]n( .*)?$", gendered) != []
            or re.findall(r" bzw\.? ", gendered) != []
        ):
            add_to_dict(ungendered, [gendered], dic["sg"])
        elif (
            re.findall(r"[iI]nnen( .*)?$", gendered) != []
            or re.findall(r" und ", gendered) != []
        ):
            add_to_dict(ungendered, [gendered], dic["pl"])
        else:
            add_to_dict(ungendered, [gendered], dic["sg"])
            add_to_dict(ungendered, [gendered], dic["pl"])
dict_to_csvs(dic, "vienna_catalog")

We can read this CSV back to a Python dictionary with the following method:

list_ = csvs_to_list("vienna_catalog")
list_[:5]
[['AHS-Lehrer', 'AHS-Lehrer*innen', '1'],
 ['AHS-Lehrer', 'AHS-Lehrerin bzw. AHS-Lehrer', '0'],
 ['AHS-Lehrer', 'AHS-Lehrerinnen und AHS-Lehrer', '1'],
 ['AHS-Lehrern', 'AHS-Lehrerinnen und AHS-Lehrern', '1'],
 ['Abfallmanager', 'Abfallmanager*innen', '1']]