# VBForums CodeBank > CodeBank - Visual Basic .NET >  Extract Text from Pdfs using iTextSharp (02-03/2005)

## stanav

Hello all,
I was recently working on a pdf manipulating project. One of the things I needed to do was to extract the text from pdf files and search for a specific phrase. I was using iTextSharp for manipulating pdfs. While iTextSharp includes a PdfReader class, it isn't directly capable of extracting text out of the box. I did some Google and all I could find is this project by Zollor http://www.codeproject.com/useritems/PDFToText.asp. Unfortunately, his code can't extract the Pdfs created by our company (while PdfBox can - but to use PdfBox it requires another library reference and adds another 16MB to the final footprint of my project and it very is sloooowwww...), so I just went ahead and wrote my own function...
And here it is. To use it, you'll have to add a reference to itextsharp.dll to your project and import iTextSharp.text.pdf

VB Code:
Imports iTextSharp.text.pdf
 Public Function ParsePdfText(ByVal sourcePDF As String, _
                                  Optional ByVal fromPageNum As Integer = 0, _
                                  Optional ByVal toPageNum As Integer = 0) As String
         Dim sb As New System.Text.StringBuilder()
        Try
            Dim reader As New PdfReader(sourcePDF)
            Dim pageBytes() As Byte = Nothing
            Dim token As PRTokeniser = Nothing
            Dim tknType As Integer = -1
            Dim tknValue As String = String.Empty
             If fromPageNum = 0 Then
                fromPageNum = 1
            End If
            If toPageNum = 0 Then
                toPageNum = reader.NumberOfPages
            End If
             If fromPageNum > toPageNum Then
                Throw New ApplicationException("Parameter error: The value of fromPageNum can " & _
                                           "not be larger than the value of toPageNum")
            End If
             For i As Integer = fromPageNum To toPageNum Step 1
                pageBytes = reader.GetPageContent(i)
                If Not IsNothing(pageBytes) Then
                    token = New PRTokeniser(pageBytes)
                    While token.NextToken()
                        tknType = token.TokenType()
                        tknValue = token.StringValue
                        If tknType = PRTokeniser.TK_STRING Then
                            sb.Append(token.StringValue)
                        'I need to add these additional tests to properly add whitespace to the output string
                        ElseIf tknType = 1 AndAlso tknValue = "-600" Then
                            sb.Append(" ")
                        ElseIf tknType = 10 AndAlso tknValue = "TJ" Then
                            sb.Append(" ")
                        End If
                   End While
                End If
            Next i
        Catch ex As Exception
            MessageBox.Show("Exception occured. " & ex.Message)
            Return String.Empty
        End Try
        Return sb.ToString()
    End Function

----------


## gtilles

Thanks,
Good timing
Exactly what I was looking for....
I'd like to figure a way to do Diffs on 2 PDF's, seems like converting first to text might be a viable solution.

----------


## danasegarane

Nice once Stanv,
                     Why dont you add one more ,that is Extract Images from PDF  :Smilie:

----------


## Dipal

Hi ..

I extract data of  pdf file using asp.net 2005 .
but I can't extract data of 1 pdf file . 
and this pdf file is Readonly (you can't copy data ).
so , I think thats why I can't Extract  Data .

if any one have any idea then please help me .

thanks .

----------


## stanav

> Hi ..
> 
> I extract data of  pdf file using asp.net 2005 .
> but I can't extract data of 1 pdf file . 
> and this pdf file is Readonly (you can't copy data ).
> so , I think thats why I can't Extract  Data .
> 
> if any one have any idea then please help me .
> 
> thanks .


PDF files can be created in many different ways... And it depends on how the pdf file was created that you can or cannot extract text from it. For example, if a pdf is made by scanning a document, it is an image and you cannot extract the text using iTextSharp or any other PDF library. In this case, you will need some kind of OCR software to do it.

----------


## Dipal

Thanks stanav .

----------


## kadsat

Hi, I want to extract the "Tags" from a "Tagged" PDF using C# or VB.Net. How i can do this with "ITextSharp" or any other opensource PDF application

Thanks,
KadSat

----------


## jivangoyal

Hi,Is it possible to search a word in PDF and get the font of this word in the file?

----------

