samedi 28 mars 2015

Query different regex strings based with varying number of words



I want to be able to get a target filename from a string list with Regex, but with query that doesn't exactly match the files. The files:



  • I have a few thousand files. I don't mind about query speed

  • can have any case

  • could have space, underscores, dashes or dots to break up words

  • if the file uses "-" to break up the document name from it's source, just ignore the source (anything before the "-").

  • if the exact terms are contained in a file that has other text in it, disregard this file (like FileList[0] in the example below)


As I will be doing many of these in my java application, I wanted to create a Java function that could handle varying sized input and filename strings.


Example:



Query = "microfluidic systems"

FileList[] = {
"The.Fabrication.of.microfluidic.systems.in.PDMS.pdf",
"IEEE - microfluidic systems.pdf",
"microfluidic systems.pdf",
"Self-Assembled Electrical Contact to Nanoparticles.pdf",
"IEEE - Gallium Alloy as Lubricant_for_High_Current - Density Brushes.pdf",
"Liquid Metal Marbles.pdf"
}


Here the second and third file should match the query.


Is this too difficult to do with regex? Or is it just that I will have to create case statements for each regex per the amount of words that are being used in my queries?


EDIT From QPaysTaxes's answer.



String yourText = "microfluidic systems";
String fileName = "sometext microfluidic systems.pdf";

String search = yourText.replace(" ", "[\\s_.-]+").toLowerCase();
Pattern pattern = Pattern.compile("\\s*" + search + "\\..+$");
Matcher matcher = pattern.matcher(fileName.toLowerCase());
if (matcher.find())
{
System.out.println(matcher.group());
}


Prints the result:



microfluidic systems.pdf


I think I might be missing something?




Aucun commentaire:

Enregistrer un commentaire