# VBForums CodeBank > CodeBank - Visual Basic 6 and earlier >  Find a string in a text file Fast....

## Jmacp

This demonstrates a faster way to search any file and pull a full line from that file given a search string, than the string buffer way. Uses an API method to open a file and store it in a buffer then instr'gs though the buffer to find a match, also compares this to the ordinary string buffer method, API wins by a factor of about 3.5 for a 1 meg text file which is not bad.

Could also open the file as binary and use instr, didn't bother with it though..

----------


## Hack

Fine job and I needed this today.   It came in handy!  :Thumb:

----------


## Merri

I wrote a competing API code that does things slightly differently. Instead of using StrConv for all the data, it just reads it into a string variable directly. I also removed the complex InStr + InStrRev code and just put in InStr vs. InStrB with a simple string search for "abc".

With a 11.8 MB file memory usage with my function was around 12 - 13 MB. Jmacp's original method jumped at around 36 - 37 MB memory usage.

On the speed side differences are greater: my code is some 5 - 6 times faster. The difference becomes greater as the file size grows.


Just a reminder of the byte versions of string functions: InStrB, LeftB$, MidB$, RightB$, LenB, ChrB$, AscB. That would be "true" binary file handling using string functions, as no textual conversion takes place. Also if you're reading UTF-8 data it is much more straightforward to pass this kind of a string to Windows string conversion function and get out a string that is ready-to-use in VB6.

*Edit!*
If you also need an ANSI Split, kinda like SplitB, see my QuickSplitB sub.

----------


## Jmacp

I should have said that my code obviously wasn't polished up.  I was just throwing in some idea's, the whole instr, instrrev part was just to get to the end point quickly, but the API ReadFile, CreateFile  was the real substance, i am sure your version is better, well done!

----------


## stakemaster

Hello Merri,
I integrated your modified function into my application and it's by far the fastest file search I've used and makes the application perfectly usable now. It took just 4 minutes to scoot through a total of 6GB of CSV data. My question to you is: is it possible to use your function in conjunction with a regular expression pattern instead of a fixed search string, and still maintain its performance?

----------


## Merri

That would depend on the regular expression mechanism: if you can pass a pointer to a buffer in memory and if you can do continuous calls without causing regular expression pattern to be analyzed each time separately, then you could achieve pretty good speeds.

In comparison if you'd need to pass "normal" strings and needed a conversion, that would cause a massive amount of extra work, a bit like how Jmacp's original code is when compared to what I did.


It is all about keeping data unmodified as much as possible.

----------


## stakemaster

If anyone has the time to investigate this for me, the attached test project will save some time.

Thanks.

----------


## Liquid Metal

Merri, your method works great but for some reason, it craps out on files that are 500,000 kb?  You have any idea?

----------


## Merri

If you are reading files that big you may wish to consider using some level of buffering instead of reading the whole file to memory at once.

----------


## Liquid Metal

> If you are reading files that big you may wish to consider using some level of buffering instead of reading the whole file to memory at once.


Agree but can your code still be applied?  It seems that your api opens the whole file at once.

Is it similar to this one:
http://www.vbforums.com/showpost.php...0&postcount=12

If yes, can you help me clean it up?  *I would really appreciate your help on this.*

----------


## Liquid Metal

Hi Merri, I got your code to work on big files.  It does run faster then any of the other method but there is a memory leak.

If you run it in Excel VBA, you can seek the memory keeps accumulating in the task manager.

----------


## Merri

The leak is probably introduced in your way of using the code. For example, the PutMem4 part of the code places the created string into a string variable. If you do this in a loop and never use vbNullString to the string variable you never free the strings from memory and thus you keep on hogging more memory.

----------


## Liquid Metal

> The leak is probably introduced in your way of using the code. For example, the PutMem4 part of the code places the created string into a string variable. If you do this in a loop and never use vbNullString to the string variable you never free the strings from memory and thus you keep on hogging more memory.


I believe that is accurate.  Is there an api to free up the memory?  When I set the string to = "", then it clears it out but has to be within the loop.

----------


## Merri

You can't use "" because that allocates an empty string. You must use vbNullString. In this case the use of vbNullString is faster than using an API call.

Note that you can also just create the buffer once and keep filling it again and again, you don't need to create the buffer over and over again. Clearing up the buffer with vbNullString would be good practice (that was left out from that example... and it should have more comments).

----------


## Liquid Metal

I did what you recommended and updated the "" to vbNullString.  That was a good recommendation.

I did what you said and only create the buffer once by moving the stringAlloc api out of the loop.  Some how, that created gibberish for my data.  Was I supposed to move the PutMem4 out of the loop too?

----------


## Merri

Yes, the string allocation and PutMem4 should always go together and be as close to each other as possible. So if you move one you must move the other. Also, regarding the end of file when buffer will be larger than the remaining file, you must decrease the buffer size. Easiest way and probably the fastest is to use LeftB$.

----------


## Vandmanden

Hi Merri, thx for your post. It's very fast.

I've a problem, because if i search a txt file for IB, and the txt file contains GOTLIB, then it would find it. Is there some way where i only get a positive hit if the file contailns IB?

----------


## Merri

To keep it relatively fast you need to code additional conditions once you have found a match. Basically you check the character before and after the match to see whether it is or isn't something you want to be there. If the characters are what you don't want then you search again.

Alternatively, if there is always a specific character before and after the string to be found, such as line change, then you can simply include them in your search.

----------


## Vandmanden

Thx merri for your reply.

There is not any specific way to identify the words, and i do therefore have a mdb file with over 10000 words, that the program should go through. I take the program 30 min. to end, and im therefore i search of some code, that can reduce that amount of time

----------


## Merri

Open up a thread in the classic VB and post some of the code you use. People are probably able to tell you about the issues that are in your existing code, in best case it is just a few small things that need to be changed to improve speed to bearable levels. Also, try to tell what is the information you want to have, ie. do you just have to know that the word is in the file or is there something more.

----------


## Thirks

Merri, thanks for code. I don't really understand it but I have cribbed your project and put one line: 


> Label2.Caption = InStrB(API_Merri, Find)


, into a loop, with the Find string being read from a file. I have output the time it takes to process each batch of 1000 searches. This shows that the search gets slower and slower.
1-1000, 2 secs
1000-2000 3 secs
2000 - 3000 4 secs etc etc.
The length of the Find string does not change. I have found that if I remove the InStrB search or if the Find string is a constant, the speed does not deteriorate. The speed improves with shorter Find strings. Whether the Find string is found or not makes no difference. Also the memory useage does not increase.
Any ideas would be greatly appreciated.

Here's my code, thanks in advance:


```
FileNo = FreeFile
Open TESTFILE For Input As #FileNo
StartTime = Now()
i = 0
Do While Not EOF(FileNo)
    Line Input #FileNo, Find
    Find = StrConv(Find, vbFromUnicode)
    Label2.Caption = InStrB(API_Merri, Find)
    i = i + 1
    If Int(i / 1000) = i / 1000 Then
        Debug.Print i & " - " & Format(Now() - StartTime, "hh:mm:ss")
        StartTime = Now()
    End If
Loop
Close #FileNo
```

----------


## Merri

If the result to find is further down in the file you search from, then it will take longer to find. In this case, if it is likely your later search keywords are down to the end of the searched file in general, then finding does take longer.

----------


## Thirks

Yup, spot on, I reversed the order of one of the files and the search started slow and got faster and faster...cheers, invaluable help.

----------


## Merri

Note that this also may mean there is a spot for further optimization. If the keywords are always found in the order from file you don't necessarily need to search from the beginning of the file, instead simply continue from the last position. Or, if it possible to sort the keywords into such order that they're found in order from the file.

On the other news, as things keep getting faster you don't want to update Label2.Caption on each loop iteration, because interacting with controls is slow. It may seem small, but in reality a lot of happens each time you change something in a control (drawing to screen, string storage etc).

----------


## coolcurrent4u

Merri, i like you code, and want to use it in some project. But i wan multiple keyword search. can i do this without looping, as i have so many keyword i want to search for at once, you can also let me know the performance issue i should be expecting

----------


## Merri

It would require more complex code than that, can't use InStr because it always looks for a single given keyword. You'd be forced to multiple loops through it all.

To make it more efficient and to truly loop through just once you'd need to 1) sort the keywords 2) do string matching manually against the keyword list 3) as the keyword list is sorted, it will be quite fast to know whether you've found what you're looking for, you don't need to check againt all the strings, just go on until you have either a perfect match or only a partial match and the next keyword can't match. Finally 4) applying some string finding algorithm such as binary search should make things quite fast and those require the keywords to be sorted. You'll have only a couple of lookups from the keywords list instead of going through all the keywords. That is the power of sorting & a good search algorithm.

----------


## coolcurrent4u

i developed some code shown in my signature (Filter array with another array), do you tin i can apply the same technique without much looping.

----------


## Aaron02

Merri

I'm getting a 'Run-time error 9 subscript out of range' as indicated in the below code:


```
Public Function ApiReadFile(ByVal strFilename As String, ByVal strStringToFind As String) As String
    Dim hFile As Long, bContent() As Byte
    Dim FileLenght As Long, Result As Long
    
    hFile = CreateFile(strFilename, GENERIC_READ, FILE_SHARE_READ Or FILE_SHARE_WRITE, ByVal 0&, OPEN_EXISTING, 0, 0)
    FileLenght = GetFileSize(hFile, 0)
    
    SetFilePointer hFile, 0, 0, FILE_BEGIN
    
    ReDim bContent(1 To FileLenght) As Byte    '<<--- Error 9
    
    ReadFile hFile, bContent(1), UBound(bContent), Result, ByVal 0&
    If Result <> UBound(bContent) Then MsgBox "Error reading file ..."
    
    CloseHandle hFile
    
    ApiReadFile = StrConv(bContent, vbUnicode)
    
    Label1.Caption = InStr(ApiReadFile, strStringToFind) 'Mid(ApiReadFile, InStrRev(ApiReadFile, vbNewLine, InStr(1, ApiReadFile, strStringToFind)), (InStr(InStr(1, ApiReadFile, strStringToFind), ApiReadFile, vbNewLine)) - InStrRev(ApiReadFile, vbNewLine, InStr(1, ApiReadFile, strStringToFind)))
    
    ReDim bContent(0) As Byte
    
End Function
```

I have not altered you project, just testing it.

----------


## Merri

What is the size of the tested file? If it is 0 bytes then the code fails as it does not check if the length is valid.

In the other hand I can't recall whether the code created a file or not, if it does create a file then make sure the project is located in a folder that you have write access to (Vista & 7 aren't as "nice" as XP is on file permissions).

----------


## Aaron02

I manually created the file:


```
Const TESTFILE = "C:\Test.txt"
```

It contains 4 bytes.
Same error.

The app didn't create it.

----------


## Nightwalker83

@ Merri

I noticed if I create the file "C:\Test.txt" and leave it empty then attempt to use your code I receive a message box saying:

File too big: 0.000 gigabytes. Shouldn't that be file too small?




> I manually created the file:
> 
> 
> ```
> Const TESTFILE = "C:\Test.txt"
> ```
> 
> It contains 4 bytes.
> Same error.
> ...


Did you write something in the file then save it?

----------


## Aaron02

It contains the word  Test

----------


## Merri

Aaron02: now that I had time to download the sample I noticed that the referred code is jmacp's original code and you get "subscript out of range" error if the file is not there. My code has a check for valid handle and it tells it could not find the file, so clicking the second button first should tell you this.

Nightwalker83: the bug is there but considering the nature of the sample it shouldn't matter that much: it is quite a small change to fix the problem and a rewrite is required for use in other purposes.

----------


## ASC2

Hello Merri, 
Is there a NOT CASE SENSITIVE version of API_Merri? 
Thx in advance, RJ

----------


## Warlock666

Hi all

Needs to be in VB6

How do you get it to continue to find the last find entry (currently only find the first instance)
ie  to shows the last GeccountId for "30180AV" which should be 0001769011GjRH9BkX58roI7eAmCurl6G6q7C5yYzJ6lwS21oJ

the only way i can think of is loading it as a text file as an string array. then reversein the sting array with another loop
then searching

But there must be an easier way

Thanks


ilarge txt file which has lay out similar to this plus other stiff in 

2012-09-25 15:01:03,421 INFO - 25 September 2012 15:01:03.421 +01:00 : [0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0] POST - GetAccountId(0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0) returning "19678JU"
2012-09-25 15:02:47,093 INFO - 25 September 2012 15:02:47.093 +01:00 : [0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0] POST - GetAccountId(0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0) returning "19678JU"
2012-09-25 15:02:53,468 INFO - 25 September 2012 15:02:53.468 +01:00 : [0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0] POST - GetAccountId(0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0) returning "19678JU"
2012-09-25 15:03:00,250 INFO - 25 September 2012 15:03:00.250 +01:00 : [0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0] POST - GetAccountId(0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0) returning "19678JU"
2012-09-25 15:03:27,656 INFO - 25 September 2012 15:03:27.656 +01:00 : [0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0] POST - GetAccountId(0001768998r8Vfb4B6ZajavFqNMWZwSbm6QsfJuLwVoHcRIWx0) returning "19678JU"
2012-09-25 15:05:17,265 INFO - 25 September 2012 15:05:17.265 +01:00 : [0001769003uH4h8ZWtQEur2kyZbBeaRA10W0FC8L1Da6ifaAB8] POST - GetAccountId(0001769003uH4h8ZWtQEur2kyZbBeaRA10W0FC8L1Da6ifaAB8) returning "30180AV"
2012-09-25 15:12:50,734 INFO - 25 September 2012 15:12:50.734 +01:00 : [0001769003uH4h8ZWtQEur2kyZbBeaRA10W0FC8L1Da6ifaAB8] POST - GetAccountId(0001769003uH4h8ZWtQEur2kyZbBeaRA10W0FC8L1Da6ifaAB8) returning "30180AV"
2012-09-25 15:31:06,703 INFO - 25 September 2012 15:31:06.703 +01:00 : [0001769011GjRH9BkX58roI7eAmCurl6G6q7C5yYzJ6lwS21oJ] POST - GetAccountId(0001769011GjRH9BkX58roI7eAmCurl6G6q7C5yYzJ6lwS21oJ) returning "30180AV"
2012-09-25 15:31:57,250 INFO - 25 September 2012 15:31:57.250 +01:00 : [0001769013Lpp8tvJXEadblzvvdyLy1CMWHQcV7fduWVjHGz4n] POST - GetAccountId(0001769013Lpp8tvJXEadblzvvdyLy1CMWHQcV7fduWVjHGz4n) returning "30180AV"
2012-09-25 15:34:35,593 INFO - 25 September 2012 15:34:35.593 +01:00 : [0001769009A3cuBbG20shtUgPv8y5uBBgTbfCc8k19BedVN1dL] POST - GetAccountId(0001769009A3cuBbG20shtUgPv8y5uBBgTbfCc8k19BedVN1dL) returning "91896YY"
2012-09-25 15:34:41,828 INFO - 25 September 2012 15:34:41.828 +01:00 : [0001769009A3cuBbG20shtUgPv8y5uBBgTbfCc8k19BedVN1dL] POST - GetAccountId(0001769009A3cuBbG20shtUgPv8y5uBBgTbfCc8k19BedVN1dL) returning "91896YY"

----------

