Here we try to extract useful data from OpenThesaurus. We make an extensive list of all words related to persons. In a further step, we try to retrieve good synonyms. The first step has worked out very well, but the second step not so much, since the synonyms are often not suitable or even in harsh langauge (there is a lot of everyday language in OpenThesaurus).

For future work, one could use the list of male persons and come up with synonyms oneself or check the synonyms from OpenThesaurus very closely.

from os import path
from typing import *
import itertools
import pandas as pd
import re
import spacy
import subprocess
import sys

sys.path.insert(0, "..")
from helpers import add_to_dict, log
from helpers_csv import csvs_to_list, dict_to_csvs

We download a MySQL dump from OpenThesaurus and run the queries given in queries.sql against it. The results are saved in the query_results folder.

iterations = []
for i in range(0, 7):
    df = pd.read_csv(path.join("query_results", "{}_iterations.csv".format(i)))
    previous = set(itertools.chain(*iterations))
    values = set(df["word"].values.tolist()).difference(previous)
    values = list(filter(lambda x: re.match(r"^[A-ZäÖÜ][a-zäöüß\-]+$", x), values))
    iterations.append(values)

print(list(map(len, iterations)))
words = sorted(list(itertools.chain(*iterations)))
[1776, 1862, 577, 90, 0, 0, 0]
open("openthesaurus_persons.csv", "w").write("\n".join(words))
48635

For figuring out the grammatical gender, we first try to use Spacy. The quality is not very high, many genders are misclassified.

nlp = spacy.load("de_core_news_sm")


def grammatical_gender(s: str) -> str:
    return nlp(s)[0].morph.get("Gender")


print(
    *list(
        map(
            lambda a: (a, grammatical_gender(a)),
            [
                "Baum",
                "Mädchen",
                "Fachkraft",
                "Manager",
                "Managerin",
                "Beamte",
                "Beamtinnen",
                "Leiter",  # does not recognize gender of the second meaning
                "Butter",  # recognized incorrectly as 'Masc'
                "Teller",  # not recognized
                "Kabbulmoffdi",  # not a word, but recognized as 'Masc'
            ],
        )
    ),
    sep="\n"
)
('Baum', ['Masc'])
('Mädchen', ['Neut'])
('Fachkraft', ['Fem'])
('Manager', ['Masc'])
('Managerin', ['Fem'])
('Beamte', ['Masc'])
('Beamtinnen', ['Fem'])
('Leiter', ['Masc'])
('Butter', ['Masc'])
('Teller', [])
('Kabbulmoffdi', ['Masc'])

The grammatical gender detection of the chosen model is not very good in general, but since it is trained on news texts, it is hopefully good enough on person words.

genders = {}
for word in words:
    for gender in grammatical_gender(word):
        add_to_dict(gender, [word], genders)
pd.DataFrame.from_dict(genders, orient="index").transpose().head(20)
Masc Neut Fem
0 Aas Abkomme Ablegat
1 Aasgeier Abstinent Ahne
2 Abdecker Abzuschiebender Ahnfrau
3 Abenteurer Adelige Ahnherrin
4 Abgeordneter Adept Akrobat
5 Abgesandter Adonis Almerin
6 Abgeschobener Afghane Alterchen
7 Abkömmling Allesbesserwisser Angie
8 Abnicker Alter Anlerntätigkeit
9 Abschiebehäftling Amtsleiter Anthropophage
10 Absolutist Anweiser Arbeitskraft
11 Abteilungsleiter Aufschneider Arztsekretärin
12 Abtrünniger Augenzeuge Ass
13 Abundzubi Auslandskorrespondent Atze
14 Abweichler Auspeitscher Aufsicht
15 Abwickler Azubi Aufwartefrau
16 Abzocker Baby Augur
17 Adeliger Bader Aushilfe
18 Adelsherrscher Bandit Aushilfskraft
19 Adliger Baronin Autorität

Because the gender detection with Spacy is not satisfactory, we instead try the RF-tagger. The RF-tagger can be downloaded for free, but we can’t redistribute it, so you will need to download it yourself, see here.

def grammatical_gender_rft_batch(tokens: List[str]) -> Dict[str, str]:
    rftagger_path = "./rf-tagger/RFTagger"
    temp_file = "test/temp.txt"
    open(path.join(rftagger_path, temp_file), "w").write("\n".join(tokens))
    result = subprocess.run(
        ["src/rft-annotate", "lib/german.par", temp_file],
        cwd=rftagger_path,
        capture_output=True,
    )
    result = result.stdout.decode("UTF-8")
    dic = {}
    for line in result.split("\n"):
        matches = re.findall(r"^.*\t", line)
        if len(matches) > 0 and len(matches[0]) > 1:
            word = matches[0][:-1]
            # spacy_genders = grammatical_gender(word)
            rft_genders = re.findall(r"Masc|Fem|Neut", line)
            if len(rft_genders) == 1:
                rft_gender = rft_genders[0]
                # spacy_gender = spacy_genders[0]
                # if rft_gender == spacy_gender:
                add_to_dict(rft_gender, [word], dic)
    return dic


genders = grammatical_gender_rft_batch(words)
print(
    *list(
        map(
            lambda a: (a, list(grammatical_gender_rft_batch([a]).keys())),
            [
                "Baum",
                "Mädchen",
                "Fachkraft",
                "Manager",
                "Managerin",
                "Beamte",
                "Beamtinnen",
                "Leiter",  # does not recognize gender of the second meaning
                "Butter",
                "Teller",
                "Kabbulmoffdi",  # not a word, but recognized as 'Neut'
            ],
        )
    ),
    sep="\n"
)
('Baum', ['Masc'])
('Mädchen', ['Neut'])
('Fachkraft', ['Fem'])
('Manager', ['Masc'])
('Managerin', ['Fem'])
('Beamte', ['Masc'])
('Beamtinnen', ['Fem'])
('Leiter', ['Masc'])
('Butter', ['Fem'])
('Teller', ['Masc'])
('Kabbulmoffdi', ['Neut'])
pd.DataFrame.from_dict(genders, orient="index").transpose().head(20)
Masc Fem Neut
0 Aasgeier Abgesandter Adoptivkind
1 Abdecker Abkomme Alter
2 Abenteurer Adoptivtochter Alterchen
3 Abgeordneter Ahnfrau Anerkennungsjahr
4 Abkömmling Ahnherrin Arschloch
5 Abnicker Amtsperson Assassine
6 Absolutist Angetraute Barbier
7 Abstinenzler Angie Berufsanerkennungsjahr
8 Abteilungsleiter Ansprechperson Berufspraktikum
9 Abtrünniger Arbeitskraft Betthupferl
10 Abundzubi Aufsicht Bienchen
11 Abwart Aufwartefrau Biest
12 Abweichler Aushilfe Bleichgesicht
13 Abwickler Aushilfskraft Blondchen
14 Abzocker Autorität Braunhemd
15 Abzuschiebender Autoritätsperson Bruderherz
16 Achsmacher Babe Bärchen
17 Adabei Bader Bürschchen
18 Adelige Baronesse Callgirl
19 Adeliger Baronin Dicke

This is still far from perfect, but better than the Spacy model. Future work could use deep-german, but I have Mac-specific trouble installing it at the moment.

open("openthesaurus_persons_male_sg.csv", "w").write("\n".join(genders["Masc"]))
39970

Next, we use opentheasurus once more to retrieve synonyms for the male-person words that we have found above. We create a new table male_persons with the single column male_personand import openthesaurus_persons_male_sg.csv. Then we run, and save the result in query_results/synonyms.csv:

select mp.male_person, t2.word as synonym from 
male_persons mp
join term t1 on mp.male_person = t1.word
join term t2 on t1.synset_id = t2.synset_id;
df = pd.read_csv(path.join("query_results", "synonyms.csv"))
df.head()
male_person synonym
0 Urmensch Mensch der Altsteinzeit
1 Urmensch Urmensch
2 Auftraggeber Auftraggeber
3 Auftraggeber Kunde
4 Auftraggeber Mandant
synonyms = df.to_records()
synonyms[:10]
rec.array([(0, 'Urmensch', 'Mensch der Altsteinzeit'),
           (1, 'Urmensch', 'Urmensch'),
           (2, 'Auftraggeber', 'Auftraggeber'),
           (3, 'Auftraggeber', 'Kunde'), (4, 'Auftraggeber', 'Mandant'),
           (5, 'Auftraggeber', 'Adressat'), (6, 'Kunde', 'Auftraggeber'),
           (7, 'Kunde', 'Kunde'), (8, 'Kunde', 'Mandant'),
           (9, 'Kunde', 'Adressat')],
          dtype=[('index', '<i8'), ('male_person', 'O'), ('synonym', 'O')])
synonyms_by_gender = grammatical_gender_rft_batch([s for _, _, s in synonyms])
synonyms_nonmale = {}
for _, male, synonym in synonyms:
    if synonym in synonyms_by_gender["Fem"] or synonym in synonyms_by_gender["Neut"]:
        add_to_dict(male, [synonym], synonyms_nonmale)

for a, b in list(synonyms_nonmale.items())[:20]:
    print(a, b)
Schnorrer ['Zecke']
Nassauer ['Zecke']
Bettler ['Zecke']
Schmarotzer ['Zecke']
Dorfmatratze ['Kurtisane', 'Dirne', 'Dorfmatratze', 'Prostituierte', 'Gunstgewerblerin', 'Freudenmädchen', 'Bordsteinschwalbe', 'Nutte', 'Strichmädchen', 'Straßenmädchen', 'Hure', 'Entspannungsdame', 'Professionelle', 'Kokotte', 'Callgirl', 'Liebesdame', 'Liebesmädchen', 'Straßendirne', 'Straßenprostituierte', 'betreibt das älteste Gewerbe der Welt', 'eine, die es für Geld macht', 'Hartgeldnutte', 'Liebesdienerin', 'Sexarbeiterin', 'Edelnutte', 'Frau für spezielle Dienstleistungen', 'Hetäre', 'Musche', 'Horizontale', 'Sexdienstleisterin', 'Schnepfe', 'Lustdirne', 'Lohndirne', 'käufliches Mädchen', 'Anbieterin für sexuelle Dienstleistungen', 'leichtes Mädchen', 'Flittchen', 'Sünderin', 'Flitscherl']
Bordsteinschwalbe ['Kurtisane', 'Dirne', 'Dorfmatratze', 'Prostituierte', 'Gunstgewerblerin', 'Freudenmädchen', 'Bordsteinschwalbe', 'Nutte', 'Strichmädchen', 'Straßenmädchen', 'Hure', 'Entspannungsdame', 'Professionelle', 'Kokotte', 'Callgirl', 'Liebesdame', 'Liebesmädchen', 'Straßendirne', 'Straßenprostituierte', 'betreibt das älteste Gewerbe der Welt', 'eine, die es für Geld macht', 'Hartgeldnutte', 'Liebesdienerin', 'Sexarbeiterin', 'Edelnutte', 'Frau für spezielle Dienstleistungen', 'Hetäre', 'Musche', 'Horizontale', 'Sexdienstleisterin', 'Schnepfe', 'Lustdirne', 'Lohndirne', 'käufliches Mädchen', 'Anbieterin für sexuelle Dienstleistungen']
Tölpel ['Trampel', 'Pappnase', 'Rindvieh', 'Kasperl', 'Trampeltier', 'Niete', 'Hohlfigur', 'taube Nuss']
Torfkopf ['Trampel', 'Niete', 'Hohlfigur', 'Pappnase', 'Rindvieh', 'taube Nuss']
Dummkopf ['Trampel', 'Niete', 'Hohlfigur', 'Pappnase', 'Rindvieh', 'taube Nuss', 'Kasperl']
Bulle ['päpstlicher Erlass', 'Enzyklika', 'Hünengestalt']
Athlet ['Sportskanone']
Sportler ['Sportskanone']
Sportsmann ['Sportskanone']
Lebensgefährte ['Ehehälfte', 'bessere Hälfte', 'Gespons', 'Ehegespons']
Gatte ['Ehehälfte', 'bessere Hälfte', 'Gespons', 'Ehegespons']
Lebenspartner ['Ehehälfte', 'bessere Hälfte']
Ehepartner ['Ehehälfte', 'bessere Hälfte', 'Gespons', 'Ehegespons']
Partner ['Ehehälfte', 'bessere Hälfte', 'Gespons', 'Ehegespons']
Göttergatte ['Ehehälfte', 'bessere Hälfte']
Tagedieb ['Hallodri', 'Faultier']