User talk:Jberkel
Archives |
---|
Catalan pronunciations
[edit]Hi, just a note to be careful when adding Catalan pronunciations. For example, you added a pronunciation of ê
to esquetx, which is wrong (it should be é
) and unlikely in any case, since ê
generally only occurs with inheritances and some old borrowings, and esquetx is a recent borrowing from English. I have documented the sources of pronunciation in the documentation to {{ca-IPA}}
; in particular, only trust the DCVB for Balearic pronunciations and don't trust cawikt at all. Benwing2 (talk) 02:34, 28 January 2024 (UTC)
- @Benwing2: Ok, I thought cawikt was fairly reliable. Btw, thanks for your great work on the Catalan corner! Jberkel 10:42, 28 January 2024 (UTC)
Statistics
[edit]Hi Jberkel, willst du noch einen neunen Update der Statistik machen? Dein letzter stammt schon wieder vom 1. Juli. Ja, ich weiß dass es eine Menge Zeit und Computerkraft beansprucht, aber ich denke wir alle möchten das einfach schon mal wieder wissen. :) Steinbach (talk) 17:18, 22 February 2024 (UTC)
- @Steinbach Hallo, würde ich gerne regelmäßig machen, aber es gibt immer noch Datenprobleme mit den HTML-Dumps: phab:T305407. Die letzten einigermaßen kompletten Daten sind vom letzten Juli. Die WMF-Leute arbeiten daran, aber irgendwie dauert das ewig, bin schon ständig am nachfragen :( Jberkel 17:42, 22 February 2024 (UTC)
- @Steinbach Gibt frische Stats… Jberkel 00:53, 5 June 2024 (UTC)
HTML Dump
[edit]Hi, I saw your posts complaining about the lack of HTML dumps as I had the same issue. I ended up creating my own HTML dump using the API to rapidly download millions of entries. I used the 20240220 XML dump as a base so that the two dumps would include exactly the same revisions. Note that the same wikitext can produce different HTML code at different points in time, so I can't guarantee that the page looks exactly as it did at the time of the XML dump.
- Pages included: non-redirects in namespaces 0 (main) and 118 (reconstruction)
- Number of lines: 7,952,575
- Time generated: February 20, 2024, 7:49:52 PM to February 22, 2024, 1:16:18 AM (EST)
- Uncompressed size: 112,213,194,308 bytes
- Compressed size: 5,482,140,342 bytes
Would you be interested in the code or the dump itself?
Ioaxxere (talk) 20:05, 22 February 2024 (UTC)
- @Ioaxxere Lol, I'm close to starting a project myself, given the glacial progress on the WMF side. Yes, I'm interested, how did you get the HTML, how long does it take? Is it the Parsoid rendered version which is used in the HTML dumps? If you want we can join forces and run it as a community project. Jberkel 09:44, 23 February 2024 (UTC)
The script works by grabbing HTML data using a revision ID. For example: https://en.wiktionary.org/w/api.php?action=parse&oldid=65853771&format=json. I'm not sure what parser is used but it seems to correspond with "view page source" in my browser. Here is the code:
import requests
import concurrent.futures
from time import time, sleep
from random import random
import mmap
import re
BATCH_SIZE = 10000
HEADER = {"User-Agent": "User:Ioaxxere"} # replace with your username
# tuned parameters
RATE_LIMIT = 80 # per second
THREAD_COUNT = 100
def fetch_data(revid):
print(revid)
while True:
starttime = time()
try:
result = requests.get(f"https://en.wiktionary.org/w/api.php?action=parse&oldid={revid}&format=json", headers=HEADER)
if result.status_code == 200: # OK
break
print("...error:", result.status_code)
except:
print("...error: Connection failed")
sleep(0.5 * (1 + random()))
waittime = THREAD_COUNT/RATE_LIMIT - (time() - starttime)
if waittime > 0:
sleep(waittime)
return result.text
def big_file_finditer(filename, pattern, flags=""):
compiled_pattern = re.compile(pattern.encode(), flags)
with open(filename, "r") as f:
return compiled_pattern.finditer(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))
pages = [revid.group(1).decode("utf-8") for revid in big_file_finditer("wikt_dump.xml", r"<ns>(?:0|118)</ns>\s+<id>\d+</id>\s+<revision>\s+<id>(\d+)", re.DOTALL)]
for i in range(0, len(pages), BATCH_SIZE):
queries = pages[i:i+BATCH_SIZE]
with concurrent.futures.ThreadPoolExecutor(max_workers=THREAD_COUNT) as executor:
output = executor.map(fetch_data, queries)
output = "\n".join(q for q in output) + "\n"
open(r"D:\wiktionarydumps\output.ndjson", "a", encoding="utf-8").write(output) # replace with your output location
Then I verified the output with this code:
import re
n = 0
with open(r"D:\wiktionarydumps\output.ndjson", "r", encoding="utf-8") as f:
for line in f:
n += 1
if line.startswith("{\"error\":"):
print(n, re.findall("\"code\":\"([^\"]+)\"", line)[0])
Which produced:
43725 nosuchrevid
82006 nosuchrevid
106857 nosuchrevid
248730 nosuchrevid
319048 nosuchrevid
323556 nosuchrevid
330049 nosuchrevid
394498 nosuchrevid
437859 nosuchrevid
448121 nosuchrevid
561668 nosuchrevid
590865 nosuchrevid
603650 nosuchrevid
610405 nosuchrevid
720072 nosuchrevid
749333 nosuchrevid
808355 nosuchrevid
814281 nosuchrevid
822969 nosuchrevid
859557 nosuchrevid
1021390 nosuchrevid
1036457 nosuchrevid
1058296 nosuchrevid
1084837 nosuchrevid
1157698 nosuchrevid
1229978 nosuchrevid
1248685 nosuchrevid
1285246 nosuchrevid
1323983 nosuchrevid
1324915 nosuchrevid
1385186 nosuchrevid
1396962 nosuchrevid
1486775 nosuchrevid
1497989 nosuchrevid
1513303 nosuchrevid
1581275 nosuchrevid
1609470 nosuchrevid
1678410 nosuchrevid
1725167 nosuchrevid
1735366 nosuchrevid
1735744 nosuchrevid
1814983 nosuchrevid
1854120 nosuchrevid
1907407 nosuchrevid
1921876 nosuchrevid
1963831 nosuchrevid
2010212 nosuchrevid
2073363 nosuchrevid
2166069 nosuchrevid
2177988 nosuchrevid
2183914 nosuchrevid
2184460 nosuchrevid
2278457 nosuchrevid
2330349 nosuchrevid
2358375 nosuchrevid
2499758 nosuchrevid
2501157 nosuchrevid
2520901 nosuchrevid
2591419 nosuchrevid
2621251 nosuchrevid
2630284 nosuchrevid
2671770 nosuchrevid
2696918 nosuchrevid
2697777 nosuchrevid
2746586 nosuchrevid
2769872 nosuchrevid
2831640 nosuchrevid
2857869 nosuchrevid
2910282 nosuchrevid
2911183 nosuchrevid
2915318 nosuchrevid
2967304 nosuchrevid
3014563 nosuchrevid
3063851 nosuchrevid
3124420 nosuchrevid
3137890 nosuchrevid
3185708 nosuchrevid
3225411 nosuchrevid
3230226 nosuchrevid
3241060 nosuchrevid
3259739 nosuchrevid
3261952 nosuchrevid
3301323 nosuchrevid
3318285 nosuchrevid
3320219 nosuchrevid
3324414 nosuchrevid
3336037 nosuchrevid
3443783 nosuchrevid
3481014 nosuchrevid
3527574 nosuchrevid
3585227 nosuchrevid
3589765 nosuchrevid
3614305 nosuchrevid
3734605 nosuchrevid
3821927 nosuchrevid
3843626 nosuchrevid
3914931 nosuchrevid
3925139 nosuchrevid
4025930 nosuchrevid
4244319 nosuchrevid
4246017 nosuchrevid
4260112 nosuchrevid
4278061 nosuchrevid
4330469 nosuchrevid
4331657 nosuchrevid
4412350 nosuchrevid
4413758 nosuchrevid
4432652 nosuchrevid
4485019 nosuchrevid
4602733 nosuchrevid
4608289 nosuchrevid
4720573 nosuchrevid
4737790 nosuchrevid
4858538 nosuchrevid
4889458 nosuchrevid
4908594 nosuchrevid
4973122 nosuchrevid
5010716 nosuchrevid
5052814 nosuchrevid
5150511 nosuchrevid
5154623 nosuchrevid
5182578 nosuchrevid
5223840 nosuchrevid
5235533 nosuchrevid
5246229 nosuchrevid
5259002 nosuchrevid
5344233 nosuchrevid
5364980 nosuchrevid
5368363 nosuchrevid
5369738 nosuchrevid
5469778 nosuchrevid
5507943 nosuchrevid
5598277 nosuchrevid
5607802 nosuchrevid
5631256 nosuchrevid
5648406 nosuchrevid
5659237 nosuchrevid
5729700 nosuchrevid
5752778 nosuchrevid
5774071 nosuchrevid
5790022 nosuchrevid
5833505 nosuchrevid
5861520 nosuchrevid
5864017 nosuchrevid
5871030 nosuchrevid
5877754 nosuchrevid
5983008 nosuchrevid
6006358 nosuchrevid
6067067 nosuchrevid
6085428 nosuchrevid
6138076 nosuchrevid
6138136 nosuchrevid
6188278 nosuchrevid
6248831 nosuchrevid
6276367 nosuchrevid
6286098 nosuchrevid
6289698 nosuchrevid
6293458 nosuchrevid
6303351 nosuchrevid
6309621 nosuchrevid
6311475 nosuchrevid
6391744 nosuchrevid
6392577 nosuchrevid
6396159 nosuchrevid
6409595 nosuchrevid
6412793 nosuchrevid
6424036 nosuchrevid
6484785 nosuchrevid
6562806 nosuchrevid
6568126 nosuchrevid
6580802 nosuchrevid
6633849 nosuchrevid
6741033 nosuchrevid
6797937 nosuchrevid
6900647 nosuchrevid
6903671 nosuchrevid
6996408 nosuchrevid
6996487 nosuchrevid
7030860 nosuchrevid
7043778 nosuchrevid
7048043 nosuchrevid
7059900 nosuchrevid
7091062 nosuchrevid
7091425 nosuchrevid
7130255 nosuchrevid
7169063 nosuchrevid
7184906 nosuchrevid
7244549 nosuchrevid
7276644 nosuchrevid
7331248 nosuchrevid
7359021 nosuchrevid
7537357 nosuchrevid
7578135 nosuchrevid
7585843 nosuchrevid
7595812 nosuchrevid
7641806 nosuchrevid
7651915 nosuchrevid
7697219 nosuchrevid
7778037 nosuchrevid
7781476 nosuchrevid
7782612 nosuchrevid
7802193 nosuchrevid
7808302 nosuchrevid
7820909 nosuchrevid
7885180 nosuchrevid
7914802 nosuchrevid
These correspond with pages in the XML dump that have recently been deleted.
I don't have the time/resources to generate these on a regular basis, but you're welcome to adapt this code for your purposes!
Ioaxxere (talk) 19:56, 23 February 2024 (UTC)
- Oh god, I just realized that adding
&parsoid=true
to the API query gives *far* better data. Time to rerun... Ioaxxere (talk) 20:09, 23 February 2024 (UTC)- Cool, thanks! We could run it on WMF infrastructure. Great to see that 50 lines of Python yield better results than the WMF's buzzword soup of Kafka, DAGs and what have you… How long does it take to do a full run? Jberkel 15:20, 26 February 2024 (UTC)
- nm, you already had in your post, almost 2 days… :) Jberkel 15:57, 26 February 2024 (UTC)
- Even if the WMF some day manage to produce useful dumps again, we'll still need wiki-specific namespaces such as Reconstruction, so it'll be useful to have some way of generating them ourselves. Jberkel 15:58, 26 February 2024 (UTC)
- nm, you already had in your post, almost 2 days… :) Jberkel 15:57, 26 February 2024 (UTC)
- Cool, thanks! We could run it on WMF infrastructure. Great to see that 50 lines of Python yield better results than the WMF's buzzword soup of Kafka, DAGs and what have you… How long does it take to do a full run? Jberkel 15:20, 26 February 2024 (UTC)
ScribuntoUnit vs. UnitTests
[edit]I just discovered there are two unit testing frameworks here, Module:UnitTests used by everyone but you, and Module:ScribuntoUnit used by you. The former is older than the latter, so I'm not sure why you imported the latter from Wikipedia, but I think we should consolidate. Can you think about converting your unit tests to use Module:UnitTests? Benwing2 (talk) 20:34, 10 March 2024 (UTC)
- Hi, just wondering if you got my msg. Can you at least clarify why you imported and started using Module:ScribuntoUnit in preference to our own module? BTW I just discovered a third unit test framework, Module:QFQ/UnitTests, used only on Module:mnw-translit. Benwing2 (talk) 07:43, 14 March 2024 (UTC)
- Hi @Benwing2, sorry had short Wiktionary hiatus. It's been a long time (~ 10 years), but I think when I first looked at Module:UnitTests it was a spaghetti mess and didn't have the features I wanted. That's probably no longer the case, and I agree it's better to standardize on one framework. Jberkel 09:27, 15 March 2024 (UTC)
Wwoww, Jberkel, you're fast. Wanted to cite the same Guardian passage here, and it was already there ... MistaPPPP (talk) 12:55, 19 March 2024 (UTC)
Apologies
[edit]I need to apologise to you also, about my simple edit in my archaic paragraph about certain 'etymologies that discredit Wiktionary' that it should have completely disrupted the edit section including yours - there should really be mechanism in place to stop this from happening, since any innocent editor could well make a similar mistake that if not detected quickly as both Surjection and I did, it could cause linguistic mayhem! Regards, Andrew Andrew H. Gray 11:40, 29 March 2024 (UTC)
On ass...
[edit]What Doyle said was about this:
https://en.m.wiktionary.org/wiki/arse#English
Here, ass is another way of spelling arse (as in dumb). Lunatone3000 (talk) 22:24, 4 April 2024 (UTC)
The reputation system
[edit]You mentioned this in a beer parlour comment about "the reputation system, for good or ill".
The reputation system is for ill.
There are editors like me whose behavior is scrutinized. And people are willing to make inaccurate claims about how many or few productive edits I've
Then there are other editors who have almost no ability at all to get along with other editors or admit wrongdoing. But, because they're perceived as being essential to the project, it's unacceptable to question their opinions or behavior. Purplebackpack89 13:46, 5 June 2024 (UTC)
- I'd say there's a mix of different people finding problems with your edits: editors who had already mentally "blacklisted" you (Equinox, putting you in the "moron" box), WF (creating RFDs "for the lulz" to create havoc), and more level-headed/diplomatic editors who see real CFI/process-related issues. As -sche pointed out, because there are so many different editors involved, it's difficult to conclude that *all* of them are here to harass you. And because this has been going on for years, patience/good will/faith is running low… Jberkel 14:59, 5 June 2024 (UTC)
- "Because there are so many different editors involved" makes it feel like I'm being harassed regardless of why they are doing it. Perhaps unwittingly, Equinox name-calling and WF/Denazz trolling made it harder for somebody like Benwing to legit address my edits. Knightwho is somewhere in between. While he may also legit want to clean up the project, he has a long and well-documented history of being confrontational. And the other problem is that Benwing and Knight could've maybe noticed that I felt put upon at the moment and maybe waited, say, a couple of weeks until things had died down. There wasn't anything they were doing that had to be addressed immediately. They didn't do that. Purplebackpack89 16:37, 5 June 2024 (UTC)
Wanted
[edit]User:Jberkel/lists/wanted hasn't bin updated4a while. Can we get it bac, pls? Denazz (talk) 22:28, 5 June 2024 (UTC)
- now iz bac. zorry for ze inconviniance caused. Jberkel 09:24, 6 June 2024 (UTC)
List user subpages
[edit]Many of the various long lists on user subpages of yours seem to have served their purpose and/or to no longer be in active use. Also, the same term often appears on multiple subpages, differing only by when they were compiled. The result of this is that using "&sort=incoming_links_desc" in the searchbox to find entries relatively important to other Wiktionary entries does not give a good list. My user pages have had the same effect. I have consequently used <nowiki>
to disable entire subpages. If you are too busy, let me know which pages are important (of what rule to follow to determine importance) so I could disable the right pages, if there are any. You are not the only one with such subpages, but yours are the ones I most notice. DCDuring (talk) 22:23, 17 July 2024 (UTC)
- Are you referring to the wanted entries lists? Yes, they should probably be deleted, but I haven't had the time to submit them all for deletion (needs to be automated, there are so many of them). Maybe some admin with scripting skills can delete them directly? Jberkel 22:26, 17 July 2024 (UTC)
- Yes, them's the ones. Did you want to extract any of the redlinks in any of them? DCDuring (talk) 23:00, 17 July 2024 (UTC)
I believe, based on a simple test on made-up subpages of mine, that you can delete all the subpages of a top subpage at once by deleting the top subpage. That wouldn't take long. I don't think adminship is required.I was wrong. It seems to be as you said. DCDuring (talk) 23:09, 17 July 2024 (UTC)- @Benwing2 Could you please mass-delete the old wanted entries lists (and dependent data modules)? Perhaps everything before 2024. Jberkel 07:07, 18 July 2024 (UTC)
- @Jberkel Can you supply me with a list (at least in schematic form, it doesn't have to include every single ifle)? Benwing2 (talk) 07:32, 18 July 2024 (UTC)
- @Benwing2: every list has two pages:
- User:Jberkel/lists/wanted/YYYYMMDD/[lang-code]
- User:Jberkel/lists/wanted/YYYYMMDD/[lang-code]/data
- Language codes are in User:Jberkel/lists/wanted/languages.
- Timestamps to delete:
- 20230701, 20230601, 20230301, 20230201, 20230101,20221001, 20220820, 20220601, 20220501, 20220401, 20220320, 20220301, 20220120, 20220101, 20211201, 20211101, 20211001, 20210901, 20210801, 20210701, 20210601, 20210501, 20210401, 20210101, 20201101, 20200401, 20200201, 20200120, 20200101, 20191201, 20191101, 20191020, 20191001, 20190901, 20190801, 20190701, 20190620, 20190601, 20190501, 20190420, 20190401.
- + for each timestamp, the overview page: User:Jberkel/lists/wanted/YYYYMMDD
- – Jberkel 13:33, 18 July 2024 (UTC)
- @Benwing2: every list has two pages:
- @Jberkel Can you supply me with a list (at least in schematic form, it doesn't have to include every single ifle)? Benwing2 (talk) 07:32, 18 July 2024 (UTC)
- @Benwing2 Could you please mass-delete the old wanted entries lists (and dependent data modules)? Perhaps everything before 2024. Jberkel 07:07, 18 July 2024 (UTC)
Hello, may I ask you why did you revert me here? Regards, RodRabelo7 (talk) 20:15, 22 July 2024 (UTC)
- Why did you replace … with ... in the first place? One is an ellipsis, the other are three dots, it's not the same thing. Jberkel 20:34, 22 July 2024 (UTC)
- Most projects use three periods instead of the ellipsis (w:WP:...), as older browsers and systems may not support it properly. I thought it to be convicing, though I confess I'm not sure if there's a policy regarding that here on Wiktionary. Best regards, RodRabelo7 (talk) 20:40, 23 July 2024 (UTC)
- In general, Wiktionary doesn't follow Wikipedia's style guidelines. The ellipsis template has been around for a long time (2008), if you want to change it to dots please start a discussion somewhere first. There's no clear policy on this afaik. Jberkel 21:20, 23 July 2024 (UTC)
- Most projects use three periods instead of the ellipsis (w:WP:...), as older browsers and systems may not support it properly. I thought it to be convicing, though I confess I'm not sure if there's a policy regarding that here on Wiktionary. Best regards, RodRabelo7 (talk) 20:40, 23 July 2024 (UTC)
Nous vous rappelons que les Actualités du Wiktionnaire sont toujours publiées, mais notre système d'annonces n'était plus en service. Veuillez nous excuser pour les inconvénients.
Un nouveau numéro des Actualités du Wiktionnaire vient de paraître !
Dans ces Actualités estivales bien fournies, une revue de presse et une liste de vidéos pour améliorer vos siestes moites, ainsi que trois articles : un dictionnaire de cooccurrences présenté par Trace, une discussion à partir d’un article sur les mots les plus recherchés dans les dictionnaires par Noé et une explication sur les enclises par Àncilu. Le tout enrobé d’illustrations d’actualité.
Découvrez le numéro 112 de juillet 2024 !
Brouillon du prochain — Anciens numéros — Abonnement-désabonnement
Cantons-de-l'Est (talk) 19:59, 14 August 2024 (UTC)
ngram dataset v3
[edit]User:Jberkel/lists/Frequency links to v2
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html was released 3 years after your last generation, perhaps you might be interested in updating? Akaibu (talk) 03:53, 22 August 2024 (UTC)
Please don't leave etymologies like that, @Trooper57 maybe you can help? Stríðsdrengur (talk) 14:23, 28 August 2024 (UTC)
- Why? It’s a wiki, a work in progress. It will get completed eventually. Patience :) Jberkel 21:47, 28 August 2024 (UTC)
- I mean, just putting "Tupian" doesn't help much Stríðsdrengur (talk) 17:30, 29 August 2024 (UTC)
- It’s not wrong is it? Someone with more knowledge can make it more precise. That’s why I added the rfe afterwards. Jberkel 18:56, 30 August 2024 (UTC)
- You still don't understand, just putting "tupian" doesn't help at all, you could try to make an effort and put at least something basic like "derived from a tupian language" Stríðsdrengur (talk) 19:27, 30 August 2024 (UTC)
- Sorry, again, it's a wiki, yes, a full sentence would be splendid, but users are surely able to figure out what "Tupian" means if they follow the link. Perhaps most of the Brazilian Portuguese borrowings are really from Old Tupi, but they could also be recent borrowings from a contemporary Tupian language. Not sure. One last thing, I'd like to echo WF's comments on your talk page, who described your communication as "patronizing" and "finger-pointing". It's not how you make friends here… Jberkel 00:07, 31 August 2024 (UTC)
- You still don't understand, just putting "tupian" doesn't help at all, you could try to make an effort and put at least something basic like "derived from a tupian language" Stríðsdrengur (talk) 19:27, 30 August 2024 (UTC)
- It’s not wrong is it? Someone with more knowledge can make it more precise. That’s why I added the rfe afterwards. Jberkel 18:56, 30 August 2024 (UTC)
- I mean, just putting "Tupian" doesn't help much Stríðsdrengur (talk) 17:30, 29 August 2024 (UTC)
Ce numéro estival est fort pourvu en actualités et en brèves ! Le dictionnaire du mois est présenté par Trace et porte sur les expressions, tandis que Noé disserte sur l’héritage et l’innovation du Wiktionnaire. Les illustrations viennent de la collection d’un musée de design !
Découvrez le numéro 113 de août 2024 !
Brouillon du prochain — Anciens numéros — Abonnement-désabonnement
Cantons-de-l'Est (talk) 13:36, 1 September 2024 (UTC)
Un numéro avec de l’argot et des langues régionales de France ! En plus des habituelles brèves, des statistiques et de la revue de presse, deux articles par Lyokoï et Noé, entourés d’illustrations d’architecture en briques !
Découvrez le numéro 114 de septembre 2024 !
Brouillon du prochain — Anciens numéros — Abonnement-désabonnement
Cantons-de-l'Est (talk) 10:50, 1 October 2024 (UTC)
Un numéro placé sous l’auspice de l’Antiquité grecque ! Outre les traditionnelles revue de presse du mois, actualité du projet et statistiques, un article sur l’évolution de l’intelligence artificielle par Romainbehar et la présentation de l’histoire des dictionnaires d’argot par Lyokoï !
Découvrez le numéro 115 de octobre 2024 !
Brouillon du prochain — Anciens numéros — Abonnement-désabonnement
Cantons-de-l'Est (talk) 10:21, 1 November 2024 (UTC)
Hi, thanks for making these "wanted terms" lists, they are really useful!
Is there any chance I could ask for some parameters to be tweaked? For example, I think this list (and presumably equivalent lists for other languages) would greatly benefit from having Wiktionary:Requested entries (Welsh), Appendix:Celtic word lists and Appendix:Word lists of languages of Europe able to "contribute" - which as far as I'm aware they currently don't?
Btw, if you could also create an equivalent list for Middle Welsh (wlm), I'd be a very happy editor.
Cheers Arafsymudwr (talk) 17:30, 2 November 2024 (UTC)