installing textract for python 3
I came across this library textract for extracting text from various formats. I was interested to use this for extracting text from html files. Here is what I did to get it to install on my machine.
The installation outlines some steps that you need to perform. These steps can be found at the following url:
http://textract.readthedocs.org/en/latest/installation.html
I wanted to install it for python 3.4 on ubuntu 14.04, but it seemed to only support python 2.x. Here is what I did to get it to install for python 3.4
Get your virtualenv setup first
virtualenv -p /usr/bin/python3.4 /usr/local
-
install required libraries for linux as outlined in the installation page.
-
download the source file for textract from
https://pypi.python.org/pypi/textract
-
untar the downloaded file
-
cd into the directory and look for cases of :
except ShellError, e:
and change it to
except ShellError as e:
- edit the requirements/python file comment out
pdfminer==20140328
- install the python 3 equivalent
pip install pdfminer3k
- finally run
python3.4 setup.py install
Everything should install at this point.