searching within multiple files

searching within multiple files

Author
Discussion

croxsons

Original Poster:

1,843 posts

201 months

Friday 8th February 2008
quotequote all
I have loads of files from an ex employee, and I need to rifle through them to grab email addresses so that we can notify customers of her departure.

Is there an easy way of doing it? The files range from word, excel and PDF's.

I have used google desktop search in the past, but has anyone had experience limiting a search to portable drive, as in this case, and for a single character "@".

TIA

sgrimshaw

7,336 posts

252 months

Friday 8th February 2008
quotequote all
Google is wonderful for this you know.

Search for "search within multiple files for email addresses"

Whole raft of tools available.

Simon

cj_eds

1,567 posts

223 months

Friday 8th February 2008
quotequote all
On a DOS prompt, change to the directory you want to search

Use "findstr" (type help findstr for the full options)
Something like :
findstr /S /M "@" .\*

should work though (/s means subdirectorys, /M prints matching files) although expect an awful lot of extra crap in the search result due to binary files etc. You can hide them with /P but you might find it also hides genuine results. Someone who's better with regular expression can probably expand "@" into something that matches email addresses with format something@domain.com

sgrimshaw

7,336 posts

252 months

Friday 8th February 2008
quotequote all
cj_eds said:
On a DOS prompt, change to the directory you want to search

Use "findstr" (type help findstr for the full options)
Something like :
findstr /S /M "@" .\*

should work though (/s means subdirectorys, /M prints matching files) although expect an awful lot of extra crap in the search result due to binary files etc. You can hide them with /P but you might find it also hides genuine results. Someone who's better with regular expression can probably expand "@" into something that matches email addresses with format something@domain.com
1. I doubt if the OP is running anything that gives him a DOS window.

2. FINDSTR will only find patterns of text in files. Ever tried opening a word, excel or pdf in a dos text editor. Unreadable.


cj_eds

1,567 posts

223 months

Friday 8th February 2008
quotequote all
I was thinking you can use that from the windows command prompt (Start->Run->enter "cmd" or "command" depending on windows version) then it'll give you a list of filenames you can check manually in dos/excel etc. There's bound to be better solutions out there but its a starting point.

ETA: OP, if you're not familiar with the command prompt etc then easier just to go googling for something to do it for you.

Edited by cj_eds on Friday 8th February 12:13

sgrimshaw

7,336 posts

252 months

Friday 8th February 2008
quotequote all
cj_eds said:
I was thinking you can use that from the windows command prompt (Start->Run->enter "cmd" or "command" depending on windows version) then it'll give you a list of filenames you can check manually in dos/excel etc. There's bound to be better solutions out there but its a starting point.

ETA: OP, if you're not familiar with the command prompt etc then easier just to go googling for something to do it for you.

Edited by cj_eds on Friday 8th February 12:13
Windows command prompt is not a DOS window. DOS commands as we remember them, like FINDSTR, have not been available since Windows ME (IIRC).

Certainly you cannot do this in XP and above.

In any case, FINDSTR would not find the text pattern in an Excel, Word or PDF file. They are not ascii files.




dilbert

7,741 posts

233 months

Friday 8th February 2008
quotequote all
I think the regex for this is;
\b[A-Z0-9._%-]+?@[A-Z0-9.-]+\.[A-Z]{2,4}\b

Edited by dilbert on Friday 8th February 12:54

croxsons

Original Poster:

1,843 posts

201 months

Friday 8th February 2008
quotequote all
sorry, was greasing a torque meter ...

firstly, I am relatively comfortable using command prompts in Win XP as in my case, but as previously stated, it won't work as can't "open" the files as such, rather just looks at the coding of the file (similar if you open a jpeg file in word).

In answer to earlier post, yes I could look on google, but lots of hits means lots of confusion. I reckoned that there would be someone who had a similar problem and had found a great solution.

cj_eds

1,567 posts

223 months

Friday 8th February 2008
quotequote all
dilbert said:
I think the regex for this is;
{{{
\b[A-Z0-9._%-]+?@[A-Z0-9.-]+\.[A-Z]{2,4}\b
}}}

Edited by dilbert on Friday 8th February 12:53
Did you google that or just knock it up off the top of your head? nerdsmile

I'll bow to the superior knowledge of dos/command prompt commands. I tend to use cygwin & grep or a text editor for this sort of thing - the latter of which isn't complicated enough to distinquish file types!

dilbert

7,741 posts

233 months

Friday 8th February 2008
quotequote all
cj_eds said:
dilbert said:
I think the regex for this is;
\b[A-Z0-9._%-]+?@[A-Z0-9.-]+\.[A-Z]{2,4}\b
Did you google that or just knock it up off the top of your head? nerdsmile

I'll bow to the superior knowledge of dos/command prompt commands. I tend to use cygwin & grep or a text editor for this sort of thing - the latter of which isn't complicated enough to distinquish file types!
Err nope, I wrote a regular expression parser a while back, and this was one of the PCRE's I used to test it.

Disclaimer - As with all regexes, implementation is subject to regional variation, and accurate performance is not garanteed!

croxsons

Original Poster:

1,843 posts

201 months

Friday 8th February 2008
quotequote all
dilbert said:
I think the regex for this is;
\b[A-Z0-9._%-]+?@[A-Z0-9.-]+\.[A-Z]{2,4}\b

Edited by dilbert on Friday 8th February 12:54
excuse the ignorance, but what is that? How is it used? My knowledge certainly doesn't cover that ...

sgrimshaw

7,336 posts

252 months

Friday 8th February 2008
quotequote all
croxsons said:
excuse the ignorance, but what is that? How is it used? My knowledge certainly doesn't cover that ...
What operating system is running on the machine you want to use to do the searching?

Is it Windows XP or Vista?

croxsons

Original Poster:

1,843 posts

201 months

Friday 8th February 2008
quotequote all
sgrimshaw said:
croxsons said:
excuse the ignorance, but what is that? How is it used? My knowledge certainly doesn't cover that ...
What operating system is running on the machine you want to use to do the searching?

Is it Windows XP or Vista?
XP Pro

alock

4,233 posts

213 months

Friday 8th February 2008
quotequote all
[quote=sgrimshaw]Windows command prompt is not a DOS window. DOS commands as we remember them, like FINDSTR, have not been available since Windows ME (IIRC)./quote]

FINDSTR is available on XP and Vista.

dilbert

7,741 posts

233 months

Friday 8th February 2008
quotequote all
croxsons said:
dilbert said:
I think the regex for this is;
\b[A-Z0-9._%-]+?@[A-Z0-9.-]+\.[A-Z]{2,4}\b

Edited by dilbert on Friday 8th February 12:54
excuse the ignorance, but what is that? How is it used? My knowledge certainly doesn't cover that ...
PCRE = PERL Compatible Regular Expression.

PERL is a programming language, that adopted the wider idea of "Regular Expressions" as a way to describe formatted text to find, using a text string. There are various flavours of regex in existence, but PCRE to me seems the most consistent and intelligable.

"GREP" is a tool for doing exactly what you are looking to do, but it doesnt just find e-mails, it'll find any text. You need the regular expression to tell it what to look for. The one I posted finds e-mail addresses.

Software that does what you are looking for can be found here.
http://www.regular-expressions.info/powergrep.html

Tha site is nothing to do with me, but is (IMO) one of the best resources for working out how to use regex. More importantly it is instrumental in the ongoing effort to standardise the regex syntax.

Edited by dilbert on Friday 8th February 13:37

sgrimshaw

7,336 posts

252 months

Friday 8th February 2008
quotequote all
croxsons said:
XP Pro
In that case just ignore all this dos window, findstr and \b[A-Z0-9._%-]+?@[A-Z0-9.-]+\.[A-Z]{2,4}\b stuff, you can't use it anyway ;-)

I just downloaded and tried Email Extractor Files V2.2 from this site http://www.technocomsolutions.com/products.html#tr...
and it does EXACTLY what you want.

Trial is free to try it, will cost you $24.95 to register it and be able to save the data to a file.

Simon

sgrimshaw

7,336 posts

252 months

Friday 8th February 2008
quotequote all
alock said:
FINDSTR is available on XP and Vista.
You are absolutely correct, it is indeed there in C:\WINDOWS\SYSTEM32.

But, it won't help the OP.

Nor will GREP or POWERGREP.

He needs something that will search Word, Excel and PDF files.

Simon

ETA - I recognise that these will find data in these files, however managing the whole process is not straightforward and since there are tools available to do exactly what is required it seems logical to take the easy route.


Edited by sgrimshaw on Friday 8th February 13:55


Edited by sgrimshaw on Friday 8th February 14:15

dilbert

7,741 posts

233 months

Friday 8th February 2008
quotequote all
sgrimshaw said:
He needs something that will search Word, Excel and PDF files.
Last time I looked, all of those formats contained plain text representations of their content, presumably in order that "grep" type functions can work.

I would accept that PDF files don't have to, but they usually do!!!

sgrimshaw

7,336 posts

252 months

Friday 8th February 2008
quotequote all
dilbert said:
sgrimshaw said:
He needs something that will search Word, Excel and PDF files.
Last time I looked, all of those formats contained plain text representations of their content, presumably in order that "grep" type functions can work.

I would accept that PDF files don't have to, but they usually do!!!
Dilbert,

I did a quick test and from "text" files, the output would be usable, but with word and excel files the output needs so much work it's just not worth it.

BTW - I just looked more closely at Powergrep, that does have more of a chance, but frankly the dedicated software is so much easier to use "why bother".

Simon

zaktoo

805 posts

209 months

Friday 8th February 2008
quotequote all
A combination of find, file, grep and pdf2text and wv (to dump text out of Word docs) would certainly do the trick.

Putting them together is left as an exercise for the reader (ha!)