Wikipedia:Bots/Requests for approval/ST47ProxyBot

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Approved.

ST47ProxyBot

Operator: ST47 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 07:33, Sunday, December 1, 2019 (UTC)

Function overview: Block IP addresses belonging to open proxies, public VPN services, and web hosts.

Automatic, Supervised, or Manual: Automatic

Programming language(s): Combination of Python and Perl

Source code available: No, to protect sources and testing methods. Mostly uses pywikibot to interact with the wiki.

Links to relevant discussions (where appropriate): WP:NOP authorizes open proxies to be blocked, Wikipedia:Bots/Requests for approval/ProcseeBot demonstrates that a bot may be used for this task.

Edit period(s): Continuously

Estimated number of pages affected: Initially a higher rate due to a large number of not-currently-blocked proxies, once it reaches steady state, probably about 100 logged actions per day.

Namespace(s): None

Exclusion compliant (Yes/No): No, not applicable

Adminbot (Yes/No): Yes

Function details: WP:NOP states that open proxies may be blocked at any time. Open proxies allow editors to evade blocks, avoid detection, or appear to be multiple users when they were actually only one person. Waiting until these proxies are abused is not practical, as multiple easily-searchable websites advertise thousands of such proxies. The number of blocks to be made is high enough that automation is required.

There is already a bot approved for this task. However, it is beneficial to have multiple independent tools. There are a large number of data sources out there, and I've already been able to find a number of proxies that I can access, but which haven't been blocked. I'm sure others can find things that I can't. Since this involves trying to test the proxies, operators in different geographical areas or homed to different ISPs may have different views on the internet, one ISP may "blackhole" a network that hosts a lot of malicious proxies while another ISP may not, or one bot might have an outage of some sort while another is still running normally.

This bot makes three types of blocks:

Open (HTTP/HTTPS/SOCKS) proxies, tested by accessing Wikipedia through the proxy
Open VPN servers, verified by at least two data sources
IP ranges used for web hosting, cloud services, etc, manually reviewed and curated

In the first two cases, it uses open sources to build a list of IP addresses, which it scans through. For proxy types that can be automatically tested, the bot attempts to access the Userinfo API. VPN testing cannot easily be automated, so instead the bot performs checks against several independent sources to determine if the IP address is a VPN node or not. The original proxy list is the first source, and the bot requires positive responses from at least two more sources before blocking. Once the bot has either successfully accessed API:Userinfo through the proxy or verified the proxy against enough independent data sources, it adds the IP to a list to be blocked.

The block duration varies based on the number of previous proxy blocks of that IP address. Currently, the first block is set for 14 days, ramping up to 2 years after enough blocks. The block is set to account-creation-blocked with anon-only unselected. I.e., this is a hardblock. The block message uses {{blocked proxy}} with a comment providing enough information for me to investigate why the IP was blocked in case there is any question. If the IP is already blocked, the bot skips it. This includes if it is already subject to a local rangeblock. The bot does not check global blocks. If the proxy is an IPv6 IP, the bot blocks the /64 range.

In the case of web hosting IP ranges, these are ranges that I identify through whois data and other sources as belonging to web hosting or other similar types of companies. I generally find an address that is being abused, usually by a spambot, and then investigate and block the entire address space assigned to web hosting at that company. In this case, I manually review the list of ranges to be blocked, and the bot blocks them for a fixed time period, generally using the {{colocationwebhost}} template, and a description of the IP ownership in the block comment. This task will never be fully automated, as it is based on me finding and deciding to block a given set of IP ranges. However, I'm bundling it with this bot request because it also entails making a large number of blocks at once.

Further, this bot may modify blocks that it initially issued, either extending them if they are near expiration and the proxy is still active, or removing them if the proxy is confirmed to be inactive. (The removal would only be after several checks in a row, over a period of time, confirm that the proxy isn't active, and this isn't currently implemented.) ST47 (talk) 07:33, 1 December 2019 (UTC)[reply]

Discussion

To which degree does this task overlap with that of User:ProcseeBot? Also, summoning their botop slakr here. Jo-Jo Eumerus (talk) 11:35, 1 December 2019 (UTC)[reply]

@Jo-Jo Eumerus: Same objective, but I am finding a large number of proxies that are not currently blocked, and that need to be. Possibly because I'm using different data sources or different testing methods. ST47 (talk) 18:32, 1 December 2019 (UTC)[reply]

What mechanism do you have in place to prevent wheel warring, especially with human admins? — xaosflux ^Talk 14:17, 1 December 2019 (UTC)[reply]

@Xaosflux: I will add a check that it will not block any IP address that has ever been unblocked by a human admin. Instead it will log those to a file or to userspace. ST47 (talk) 18:32, 1 December 2019 (UTC)[reply]

Does this bot support IPv6? SQL ^{Query me!} 16:12, 1 December 2019 (UTC)[reply]

Also, as far as the major compute services (amazon, azure, google) - how are you identifying these? SQL ^{Query me!} 16:14, 1 December 2019 (UTC)[reply]

@SQL: It does support IPv6, and blocks the /64 of detected IPv6 proxies. There is no special handling for the major compute services. Most of them are already blocked, if a blocked IP (including via a range block) shows up on one of the proxy lists, I currently don't even bother testing it, no point since it's already blocked. (For the ancillary task of blocking web hosting types of IP ranges, that would be through manual review of the whois information. Basically, if I CU a spambot and find that it's on some minor cloud services company's IP range, I run it through ISP rangefinder, review the netblock names, and block everything that looks webhost-y. The only automation for those is to save me from clicking the block button 100 times.) ST47 (talk) 18:32, 1 December 2019 (UTC)[reply]
ST47, It is often not the case that the major providers are already blocked. Amazon, and azure get new ranges all the time, see: User:SQL/Non-blocked compute hosts. I've always used [1] to detect azure, and [2] to detect amazon. Google is a bit more complicated. SQL ^{Query me!} 18:39, 1 December 2019 (UTC)[reply]

@SQL: What would you suggest doing with this information? Refrain from blocking proxy IPs within that range, or block the entire range? ST47 (talk) 19:14, 1 December 2019 (UTC)[reply]

ST47, I normally completely block those ranges by hand, It can be a bit tedious / tiresome. They're webhosts, and very commonly used by spammers / UPE. SQL ^{Query me!} 21:02, 1 December 2019 (UTC)[reply]

@SQL: Right, I guess my point is that if that range isn't blocked yet, it's still good to directly block any proxies within that range. (Hopefully, that range won't be unblocked for more than a couple of days, and once the range does get blocked, the bots will stop scanning it for proxies, and the short initial block will eventually expire, leaving the long-term rangeblock.) Improving the automation around detecting (and hopefully eventually blocking and unblocking) hosting ranges is important, but isn't this bot request's objective. ST47 (talk) 21:26, 1 December 2019 (UTC)[reply]

Fair enough, I thought that this would fall under point 3, IP ranges used for web hosting, cloud services, etc, manually reviewed and curated. SQL ^{Query me!} 22:10, 1 December 2019 (UTC)[reply]

That's intended to cover cases when I checkuser a spambot, find some huge webhosting range, run it through ISP rangefinder and decide to block the whole darn thing. Or for another example, based on this experimental product, I found these guys. There are over 1,000 individually blocked proxy IPs in that AS. Some of the ranges in ISP Rangefinder I wasn't sure about, and left unblocked. But still, the 81 ranges that I did block, I think should be done with a bot account - it's assisted rather than automatic, but doing it with my normal account just floods my block log. ST47 (talk) 22:35, 1 December 2019 (UTC)[reply]

What authentication mechanism do you plan to use for this bot? Do you plan to secure the account with 2FA? — xaosflux ^Talk 18:19, 1 December 2019 (UTC)[reply]
- @Xaosflux: BotPasswords with only the necessary permissions granted, and obviously a strong unique password on the main account. I don't currently use the beta 2FA extension, preferring instead to use strong random passwords which are only used for a single account on a single website. ST47 (talk) 18:32, 1 December 2019 (UTC)[reply]

How are you going to avoid wheeling with User:ProcseeBot which it appears uses different block lengths than your proposal. We really don't need that bot coming around and blocking for one period, then you coming around and blocking for a different period. — xaosflux ^Talk 18:41, 1 December 2019 (UTC)[reply]
- @Xaosflux: I don't really see what would be wheeling about that. If an IP is already blocked, neither this bot not ProcseeBot would change the block duration. If an IP isn't currently blocked, then whichever bot or human admin gets around to it first would choose a block duration. ST47 (talk) 19:05, 1 December 2019 (UTC)[reply]
  - @ST47: ok so if the other bot thinks a one year block is fine, your bot isn't going to go and change it to a 2 year? — xaosflux ^Talk 20:08, 1 December 2019 (UTC)[reply]
    - @Xaosflux: That is absolutely correct. If an IP is blocked, either directly or via a range block, this bot does not modify the block settings at all. ST47 (talk) 20:14, 1 December 2019 (UTC)[reply]

{{BAGAssistanceNeeded}} I know we like to leave these open for a while for comment, but there hasn't been any action in a while. It has already been advertised in several places including AN. Is a trial possible? ST47 (talk) 05:28, 3 January 2020 (UTC)[reply]
@ST47: please build out a userpage at User:ST47ProxyBot that clearly explains what this bot will do, and what anyone who is blocked by it should do if they have issues with their block. — xaosflux ^Talk 15:15, 3 January 2020 (UTC)[reply]
@Xaosflux: done. ST47 (talk) 20:22, 4 January 2020 (UTC)[reply]
WP:NOP is a clear justification for blocking open proxies, but what is the policy that allows for indescriminate blocking of VPNs (especially when the block is based on hearsay from unnamed third-party sources) or indescriminate blocking of webhosts? How often are these third-party VPN lists updated, and how quickly are IP addresses removed from those lists when they are no longer hosting open VPNs? Is your bot going to be regularly checking the open proxies, VPNs, and webhosts and unblocking them as soon as they are no longer open? --Ahecht (TALK
PAGE) 15:59, 5 February 2020 (UTC)[reply]
If my understanding is correct, open VPNs can and do work like open proxies insofar as everybody can access and use them without being detectable. Jo-Jo Eumerus (talk) 16:09, 5 February 2020 (UTC)[reply]
The difference is that the bot actually tests open proxies, but does not test VPNs. IP addresses can and do change hands, and many of these lists are incorrect (just see Pppery's SPI case for an example of an editor being caught up in this). --Ahecht (TALK
PAGE) 22:06, 26 February 2020 (UTC)[reply]
By the time an IP is ready to be blocked, I am confident that there is a VPN service, but I'm not able to test whether can use it to access Wikipedia. (Proxy list + nmap service fingerprint seems to be the best way to do so, I was using a couple of APIs but there are issues with rate limits.) The block durations are the main mechanism used to limit how long an IP address remains blocked after the proxy or VPN service is removed. In order to get a really long block, an IP needs to have already been an open proxy for a really long time. Automatic unblocking is not in scope for this request. This bot approval would not affect the SPI case you mentioned, which appears to have been related to a commercial VPN service which was showing up in WHOIS data. WHOIS data is not relevant to this request, and this request does not target commercial VPN services, which I believe we already have well in hand. ST47 (talk) 07:00, 28 February 2020 (UTC)[reply]

{{BAGAssistanceNeeded}} Folks, this is a pretty important task, and the operator has provided good answers to all questions/concerns above. Any particular reason why it hasn't been approved for trial? -FASTILY 03:33, 21 March 2020 (UTC)[reply]

Various reasons, I suspect, but likely because we were busy with what we feel are more important things and/or felt that the scope/details of the task were too far above their comfort level. Personally, I left it be because two other BAGs were (I thought) going to deal with it, but clearly they have decided to wait. Primefac (talk) 22:17, 22 March 2020 (UTC)[reply]

Approved for trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. I'm going to grant adminbot for a week, but I would like to see at most 10 blocks. After this has been completed, please post the full results here with a note at WP:AN so that input about the blocks/length/etc can be evaluated. Primefac (talk) 22:17, 22 March 2020 (UTC)[reply]

@Primefac: Can I suggest not granting "bot" rights, so that the blocks can be seen in recent changes? DannyS712 (talk) 22:25, 22 March 2020 (UTC)[reply]

Trial Results

IP Address Blocked	Type of Proxy	Basis for Block	Previous Block Count	Block Duration	WHOIS Link
108.60.113.218	Public OpenVPN Server	Advertised on an open proxy list, confirmed port open, confirmed VPN server listening, listed on proxy blacklists	2, both for the same reason (although on a different port)	1 year	WiLine Networks Inc. (commercial ISP)
119.193.22.213	Public OpenVPN Server	Advertised on an open proxy list, confirmed ports (TCP and UDP) open, confirmed VPN server listening, listed on proxy blacklists	0	2 weeks	Korea Telecom (residential ISP)
24.189.33.84	Public OpenVPN Server	Advertised on an open proxy list, confirmed ports (TCP and UDP) open, confirmed VPN server listening, listed on proxy blacklists	2, both for the same reason	1 year	Optimum Online (residential ISP?)
121.109.129.46	Public OpenVPN Server	Advertised on an open proxy list, confirmed ports (TCP and UDP) open, confirmed VPN server listening, listed on proxy blacklists	2, both for the same reason	1 year	KDDI Corporation (Japanese company, unknown connection type)
201.113.47.213	Public OpenVPN Server	Advertised on an open proxy list, confirmed port open, confirmed VPN server listening, listed on proxy blacklists	0	2 weeks	Uninet S.A. de C.V. (residential ISP?)
5.17.89.13	Open SOCKS Proxy Server	Advertised on an open proxy list, confirmed able to reach Wikipedia through the proxy, listed on proxy blacklists	1, same reason	2 months	Z-Telecom Network (unknown ISP)
159.192.253.187	Open SOCKS Proxy Server	Advertised on an open proxy list, confirmed able to reach Wikipedia through the proxy, listed on proxy blacklists	1, by ProcseeBot for the same reason	2 months	CAT Telecom (unknown ISP)
103.206.225.64	Open SOCKS Proxy Server	Advertised on an open proxy list, confirmed able to reach Wikipedia through the proxy, listed on proxy blacklists	0	2 weeks	Acme Diginet Corporation (unknown type of company)
190.104.204.242	Open SOCKS Proxy Server	Advertised on an open proxy list, confirmed able to reach Wikipedia through the proxy, listed on proxy blacklists	1, by ProcseeBot for the same reason	2 months	Nestle Argentina (business, likely compromised server?)
103.117.110.254	Open SOCKS Proxy Server	Advertised on an open proxy list, confirmed able to reach Wikipedia through the proxy, listed on proxy blacklists	2, same reason	1 year	I Link Internet Service (residential ISP?)

Trial complete.@Primefac:, I have run the bot for 10 actions, and listed the results above. As it turned out, most (but not all) of the IPs had been blocked before, so you can see the escalating block duration - a first detection results in a block for 2 weeks, whereas the IP addresses that have been hosting a proxy for a longer period of time also get longer blocks. Two of the addresses are also blocked globally, likely due to Jon Kolbert's work dealing with spambots. Let me know if you have any questions, and I'll also post to WP:AN as you asked. ST47 (talk) 02:54, 23 March 2020 (UTC)[reply]

Tag added for the bot. I'll leave this open for about a week to see if there's any input. Primefac (talk) 00:24, 25 March 2020 (UTC)[reply]

Approved. Primefac (talk) 12:15, 3 April 2020 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.