Would suggest only using the regex to find file extensions (.xml
or .csv
). This should be a lot faster and when found, you can look backwards, examining each character before and stop when you reach one that couldn't be in a URL - see below:
final HashSet<String> urls = new HashSet<String>();
final Pattern fileExtRegex = Pattern.compile("\\.(xml|csv)");
final Matcher fileExtMatcher = fileExtRegex.matcher(htmlString);
// Find next occurrence of ".xml" or ".csv" in htmlString
while (fileExtMatcher.find()) {
// Go backwards from the character just before the file extension
int dotPos = fileExtMatcher.start() - 1;
int charPos = dotPos;
while (charPos >= 0) {
// Break if current character is not a valid URL character
char chr = htmlString.charAt(charPos);
if (!((chr >= 'a' && chr <= 'z') ||
(chr >= 'A' && chr <= 'Z') ||
(chr >= '0' && chr <= '9') ||
chr == '-' || chr == '+' || chr == '&' || chr == '@' ||
chr == '#' || chr == '/' || chr == '%' || chr == '?' ||
chr == '=' || chr == '~' || chr == '|' || chr == '!' ||
chr == ':' || chr == ',' || chr == '.' || chr == ';')) {
break;
}
charPos--;
}
// Extract/add URL if there are valid URL characters before file extension
if ((dotPos > 0) && (charPos < dotPos)) {
String url = htmlString.substring(charPos + 1, fileExtMatcher.end());
urls.add(makeAbsoluteURL(url));
}
}
Small disclaimer: I used part of your original regex for valid URL characters: [-a-zA-Z0-9+&@#/%?=~_|!:,.;]
. Haven't verified if this is comprehensive and there are perhaps further improvements that could be made, e.g. it would currently find local file paths (e.g. C:\TEMP\myfile.xml
) as well as URLs. Wanted to keep the code above simple to demonstrate the technique so haven't tackled this.
EDIT Following the comment about effiency I've modified to no longer use a regex for checking valid URL characters. Instead, it compares the character against valid ranges manually. Uglier code but should be faster...