unicodedata.category(unichr)
そのコードポイントのカテゴリを返します。
unicode.orgでカテゴリの説明を見つけることができますが、関連するものはL、N、P、Z、およびおそらくSグループです。
Lu Uppercase_Letter an uppercase letter
Ll Lowercase_Letter a lowercase letter
Lt Titlecase_Letter a digraphic character, with first part uppercase
Lm Modifier_Letter a modifier letter
Lo Other_Letter other letters, including syllables and ideographs
...
You might also want to normalize your string first so that diacriticals that can attach to letters do so:
unicodedata.normalize(form, unistr)
Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
Putting all this together:
file_bytes = ... # However you read your input
file_text = file_bytes.decode('UTF-8')
normalized_text = unicodedata.normalize('NFC', file_text)
allowed_categories = set([
'Ll', 'Lu', 'Lt', 'Lm', 'Lo', # Letters
'Nd', 'Nl', # Digits
'Po', 'Ps', 'Pe', 'Pi', 'Pf', # Punctuation
'Zs' # Breaking spaces
])
filtered_text = ''.join(
[ch for ch in normalized_text
if unicodedata.category(ch) in allowed_categories])
filtered_bytes = filtered_text.encode('UTF-8') # ready to be written to a file