Recommend this page to a friend! |
All requests | > | What is the best PHP search string in... | > | Request new recommendation | > | Featured requests | > | No recommendations |
by srizoophari - 8 years ago (2016-05-03)
+4 | I need a library or class to search some string in PDF and return the matched string page number. |
1. by Manuel Lemos - 8 years ago (2016-05-06) Reply
There are classes to extract PDF to text but also return the original page of the text I am not sure if the existing ones can do it.
+2 | by Christian Vigh 395 - 8 years ago (2016-05-06) Comment I have made a class to extract text contents from pdf files ; however it does not take care of the page number. Maybe it could be a first step ? |
1. by Manuel Lemos - 8 years ago (2016-05-09) Reply
It would be better if you could count pages to also give the page number of each text block. Is that difficult?
2. by Christian Vigh Reply
- 8 years ago (2016-05-16) in reply to comment 1 by Manuel Lemoswell, it could range from somewhere between tedious and a nightmare... :-) I'm kidding ; in fact, I already put that on my to-do list when posting my initial answer because, although my original concern was only extracting text, I thought it was a good idea to be able to locate text in the whole document.
I will add a "Pages" array property that will contain the text of individual pages. I will also add a GetPageOf ( $offset ) that will return the page number given a byte offset in the Text property. And maybe, some methods to simply find the page number(s) of some text.
I think everything should be ready by the end of this week.
3. by Manuel Lemos - 8 years ago (2016-05-17) in reply to comment 2 by Christian Vigh Reply
Great. That would make your package innovative. There are already classes to extract text from PDF but none would get the pages of the text objects.
3. by Manuel Lemos - 8 years ago (2016-05-17) in reply to comment 2 by Christian Vigh Reply
Great. That would make your package innovative. There are already classes to extract text from PDF but none would get the pages of the text objects.
4. by Christian Vigh Reply
- 8 years ago (2016-05-20) in reply to comment 3 by Manuel LemosHi everybody,
I'm glad to announce that the PdfToText class is now able to retrieve the page number of any text located in a pdf document.
7 new methods are available to retrieve this information : GetPageFromOffset, text_strpos/text_stripos, document_strpos/document_stripos, and text_match/document_match (see README.md).
There is also a Pages array property that holds the text contents of individual pages in the document
5. by Manuel Lemos - 8 years ago (2016-05-20) in reply to comment 4 by Christian Vigh Reply
That is great. I have not seen a package, PHP or other language that could do that.
6. by Marcelo - 6 years ago (2018-07-24) in reply to comment 5 by Manuel Lemos Reply
Hello Christian, I have very large PDFs (200MB) and I can not extract all the text from them. Would you have any solution for this? Within these PDFs there are images too, so the size is excessive. I just need the text. Your function can read the file but can not process. I await your suggestion.
Recommend package | |
|