I want to be able to get a target filename from a string list with Regex, but with query that doesn't exactly match the files. The files:
- I have a few thousand files. I don't mind about query speed
- can have any case
- could have space, underscores, dashes or dots to break up words
- if the file uses "
-
" to break up the document name from it's source, just ignore the source (anything before the "-
"). - if the exact terms are contained in a file that has other text in it, disregard this file (like FileList[0] in the example below)
As I will be doing many of these in my java application, I wanted to create a Java function that could handle varying sized input and filename strings.
Example:
Query = "microfluidic systems"
FileList[] = {
"The.Fabrication.of.microfluidic.systems.in.PDMS.pdf",
"IEEE - microfluidic systems.pdf",
"microfluidic systems.pdf",
"Self-Assembled Electrical Contact to Nanoparticles.pdf",
"IEEE - Gallium Alloy as Lubricant_for_High_Current - Density Brushes.pdf",
"Liquid Metal Marbles.pdf"
}
Here the second and third file should match the query.
Is this too difficult to do with regex? Or is it just that I will have to create case statements for each regex per the amount of words that are being used in my queries?
EDIT From QPaysTaxes's answer.
String yourText = "microfluidic systems";
String fileName = "sometext microfluidic systems.pdf";
String search = yourText.replace(" ", "[\\s_.-]+").toLowerCase();
Pattern pattern = Pattern.compile("\\s*" + search + "\\..+$");
Matcher matcher = pattern.matcher(fileName.toLowerCase());
if (matcher.find())
{
System.out.println(matcher.group());
}
Prints the result:
microfluidic systems.pdf
I think I might be missing something?
Aucun commentaire:
Enregistrer un commentaire