What is the best PHP search string in pdf class?: Search string in PDF and return page number

Recommend this page to a friend!

All requests

What is the best PHP search string in...

Request new recommendation

Featured requests

No recommendations

What is the best PHP search string in pdf class? #search string in pdf

Edit

by srizoophari - 9 years ago (2016-05-03)

Search string in PDF and return page number

+5	I need a library or class to search some string in PDF and return the matched string page number.

1 Clarification request
1. by Manuel Lemos - 9 years ago (2016-05-06) Reply
There are classes to extract PDF to text but also return the original page of the text I am not sure if the existing ones can do it.

Ask clarification

2 Recommendations

PHP PDF to HTML: Convert PDF to HTML using Poppler

This class can convert PDF to HTML using Poppler program.

It can take the path of the Poppler program tools and execute several operations to extract information from PDF documents.

Currently the class can convert whole PDF documents or individual pages to HTML, get the document information, return the page count, etc..

Several parameters can be configured like the the preferred format of the pictures inside the document, zoom scale, whether to use images or CSS inline within the HTML or as external files, etc..

+1	by Anton N Nikolaev package author 215 - 8 years ago (2016-12-02) Comment I like it.

PHP PDF to Text: Extract text contents from PDF files

This package can extract the text contents from a PDF file using pure PHP code (no external tools are needed).

It provides the following features:

- Text is extracted from PDF files as a single text property. Individual page contents are also available separately
- Text strings can be searched over the whole file contents, or through individual pages
- Support for multiple character sets: parsed text is returned in UTF8
- Embedded images can be extracted if desired
- Several option flags are available to adjust PDF contents processing
- RTL language processing
- Basic page layout rendering
- PDF Form data extraction
- Ability to extract areas of text as well as line and column contents, using an XML-based capture definitions

+3	by Christian Vigh package author 435 - 9 years ago (2016-05-06) Comment I have made a class to extract text contents from pdf files ; however it does not take care of the page number. Maybe it could be a first step ?

7 Comments
1. by Manuel Lemos - 9 years ago (2016-05-09) Reply
It would be better if you could count pages to also give the page number of each text block. Is that difficult?
2. by Christian Vigh package author - 9 years ago (2016-05-16) in reply to comment 1 by Manuel Lemos Reply
well, it could range from somewhere between tedious and a nightmare... :-) I'm kidding ; in fact, I already put that on my to-do list when posting my initial answer because, although my original concern was only extracting text, I thought it was a good idea to be able to locate text in the whole document.

I will add a "Pages" array property that will contain the text of individual pages. I will also add a GetPageOf ( $offset ) that will return the page number given a byte offset in the Text property. And maybe, some methods to simply find the page number(s) of some text.

I think everything should be ready by the end of this week.
3. by Manuel Lemos - 9 years ago (2016-05-17) in reply to comment 2 by Christian Vigh Reply
Great. That would make your package innovative. There are already classes to extract text from PDF but none would get the pages of the text objects.
3. by Manuel Lemos - 9 years ago (2016-05-17) in reply to comment 2 by Christian Vigh Reply
Great. That would make your package innovative. There are already classes to extract text from PDF but none would get the pages of the text objects.
4. by Christian Vigh package author - 9 years ago (2016-05-20) in reply to comment 3 by Manuel Lemos Reply
Hi everybody,

I'm glad to announce that the PdfToText class is now able to retrieve the page number of any text located in a pdf document.

7 new methods are available to retrieve this information : GetPageFromOffset, text_strpos/text_stripos, document_strpos/document_stripos, and text_match/document_match (see README.md).

There is also a Pages array property that holds the text contents of individual pages in the document
5. by Manuel Lemos - 9 years ago (2016-05-20) in reply to comment 4 by Christian Vigh Reply
That is great. I have not seen a package, PHP or other language that could do that.
6. by Marcelo - 6 years ago (2018-07-24) in reply to comment 5 by Manuel Lemos Reply
Hello Christian, I have very large PDFs (200MB) and I can not extract all the text from them. Would you have any solution for this? Within these PDFs there are images too, so the size is excessive. I just need the text. Your function can read the file but can not process. I await your suggestion.

Recommend package

About us

Advertise on this site

For more information send a message to info at phpclasses dot org.