How I Counted Over a Million NCBI PubMed Central Records
I read about drug resistance in cancer and wanted to know more about it. I visited this database https://www.anticancerfund.org/en/cancerdrugs-db, for drugs for which resistance is documented and they allowed me to download the data too ! so I downloaded it and saved it as a text file.
So now to view the file. I will simply open it with notepad. This file looks like a tab separated file.
I first read the file using the standard libraries in python. The “with” statement opens the text file for Python as an object and closes it too. The “readlines” method reads the individual lines in the text and puts it in a Python container called List. Lastly, we use a “for” loop to read back the lines.
But the output in the terminal of VS code, my code-editor doesn’t look legible
I am now trying the Pandas library in Python. Pandas stands for panel data and was created to handle tabular data. I will also use the web-based interactive Jupyter Notebook environment. In Jupyter Notebook, I can code and immediately view the output (https://hub.packtpub.com/10-reasons-data-scientists-love-jupyter-notebooks/). Both Pandas and the Jupyter Notebook are available as part of the Anaconda distribution. Anaconda runs OK on my system with Intel Core i3 6006U CPU 2 core and 12 GB RAM. VS code itself has Jupyter notebook plugins, however, for beginners it is always better to use an online Jupyter Notebook platform like Google Colab (https://medium.com/@siddesh.001/top-5-online-free-notebook-ipynb-and-other-cloud-services-dbf9580d99e3). I personally use the Jovian platform (https://www.jovian.ai/). Here we can run Jupyter notebooks in different environments like Binder , Google Colab and Kaggle and also there is a great community support.
Once the notebook is active, I will import all the required libraries. Just type or paste your code and press <shift Enter> to execute it. Just ensure that the code tab is set to ‘code’. These libraries are bundles of code which perform specific tasks. The requests library helps to communicate with other websites via the HTTP (Hyper Text Transfer Protocol). It is the Mailman. It takes an address and a set of instructions from us, travels via the HTTP highway and passes it to warehouse (server in computer speak). If we give a GET request, then it collects the specific data from the warehouse which the address specifies based on the instructions (payload in computer speak) and if we give a POST request, then it stores data in that specific warehouse. The pandas library has whole host of methods (code to do a specific task) to view, analyze and modify tabular data. Pandas can read data in various formats like csv, json, HTML, XML, text etc. You can store single columns as a ‘series’ object and multi-column tables as a ‘dataframe’ object. And then you can do whatever you do in Microsoft Excel, as far as data manipulation like making Pivot tables, and selecting sub-sets of data etc. is concerned. Since Pandas is a Python library, it easily communicates with other Python libraries like Numpy, the numerical computing library and re, the regular expression library. Also you can use Python functions like map, filter and reduce to manipulate the data with simple one-liner ‘lambda functions’.
We were trying to view the tabular data in the terminal of VS Code without much clarity. Let us try here with the Pandas library
Oh…. The interpreter threw an error. It shows a ParserError and I simply select the text and head to Stackoverflow (https://stackoverflow.com/).
There lies the solution. I needed to specify the delimiters. By pd.read_csv(..), I am calling the read_csv method of Pandas and I pass the name of the text data file, sep and header as parameters. I also store the output in a variable ‘cancer_drugs’ using the = symbol, so that I can use the contents of the output later.
There is the nicely formatted table ! Well, I don’t need to view all the columns, so I see all the available columns.
Now, choosing the columns
Now, I have the table in the format I want and I want to find all the records in PubMed related to all the drugs in this table and using the command below, I find there are 284 drugs. So, How do I find how many records are there in PubMed for each drug ? Hmm…
To programmatically access the PubMed records we need to use the Entrez Programming Utilites (E- Utilities). https://www.ncbi.nlm.nih.gov/ books/NBK25497/. The Entrez Programming Utilities (E-utilities) are a set of 9 server-side programs that provide a stable interface into the Entrez query and database system at the National Center for Biotechnology Information (NCBI) [sic]. Of these utilites, we are interested in ESearch (text searches) ( eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi ). ESearch is an API ( Application Programming Interface), which means it will let our commands via requests to talk to the PubMed data storage and retrieval facility as per a set of rules. In our Python script (script is set of instructions), when we call requests, it needs 2 parameters: 1) the API ‘end point’ or where to specifically search and 2) the ‘payload’ or what to do. It will take the drug name, the database to search, the format and amount of information to send back as input. Also, since we will be requesting a lot of information, it needs to verify that we are genuine and will ask for an API key. To create a API key, create or login to your NCBI account (https://www.ncbi.nlm.nih.gov/myncbi/) and go to the settings. Your API key is in this tab and simply copy paste it in your script.
In response to our query, EFetch provides us (returns in computer speak)with a list of PMIDs for all the research articles linked to that drug in PubMed database. A PMID is the unique identifier number used in PubMed for each article, and thus the size of the list of PMIDs for each drug will tell us about number of articles available for each.
OK. So first let me get the list of drugs from the cancer_drugs table.
Let us take the first drug Abemaciclib and search for it at NCBI web browser.
We see that there are 135 articles in PubMed, while there are 1,732 articles in PubMed Central (PMC). Since there are more articles in PMC, we shall attempt to programmatically access from PMC using our python script.
Next, We find that there are 1732 articles in PubMed Central if the query searches throughout the text and 58 articles if the query searches only within the abstract. The AND symbol makes the search function to search for both the words Abemacicilib and resistance. So we list out what ESearch needs to function:
1) Base URL : https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
2) db : database- default is PubMed but we will search PubMed Central — “pmc”
3) term: the query term- “Abemaciclib AND resistance”
4) there are optional parameters for using the History Server for iterative
searching but we do not need it now
5) retstart: retrieve from middle of search, we don’t need it now
6) rettype: ‘uilist’- retrieve the record as XML, default for PMC
7) retmode: for ‘uilist’ retmode is XML, default for rettype ‘uilist’
8)sort: sort the output according to publication date or the journal-again not useful now as we want the total numbers
9)field: search field. This is an useful term. let us see the available fields
To make our search more specific we shall use the abstract term “abstract”. This will limit the search to all those articles where the query is present in the abstract and should highlight articles where the resistance to Abemaciclib is the focus.
10) idtype: type of identifier- not useful for now
11) optional parameters for searching on basis of dates, not needed now
OK. So now we will start our journey to write the script to find the articles we want. One good thing of using the Jupyter notebook is that we can type our code in the input cell and immediately see the outcome in the output cell.
Now let us call the GET method and see the output. requests.get(…) returns a response object, in that we get several results bundled together and we can use different commands to unpack and see these results.
The first thing to note is the status code. A status code of 200 indicates a successful handshake with the server by the GET request. Then is the header information- the server sends metadata about the content it is sending back, for eg. we can see that the content type of this message is XML format. The URL shows how the exact HTTP request is and the content shows the content output in text format. Here too we can see that the format is XML format in the first line. What we are interested is the 17th line <count> tag. we can see that there are 1732 entries for the key word search Abemaciclib AND resistance and this is the same number as we saw in the browser above. Eventually, we will try to extract this information.
We have seen that the result that is obtained is in the XML format. This format is handy for transporting and storing data (https://www.w3schools.com
/xml/xml_whatis.asp). Essentially each specific information is stored within specific tags. For e.g., all names in a XML document will be stored in tags like these- <name> Thor </name>, hence I need to only search for name tags to get the names. We need special programs called parsers to interpret these data types. In Python, ElementTree XML library can efficiently interpret XML data.
Thus thanks to XML and the ElementTree library, with 3 lines of code, we can go through 1741 lines of text and find the number of articles needed. Now we need this information for the other 283 drugs. For this, we will take the list of drugs, and find this information one by one in a ‘For’ loop for all the drugs and of course we will package all these instructions into a function.
Our function gives the desired output. Now we will can call this function for processing all the drugs in the list, and we shall create another function for this
So finally , we get the desired output for the all the 284 drugs from the database and where resistance has been reported and the number of articles in PubMed Central corresponding to these. Now that we have the desired output, let us see the top 20 drugs for which records were retrieved, but first save the output as a csv file for future use.
Well ! how articles in all did we count ?
We counted and listed over 1.5 million articles in an around 2 minutes. Now let us visualize the distribution of the drugs with highest records of resistance
Finally we have the visual representation of the drugs and the number of records in PMC associated with resistance to that drug. There are other ways
to process this data, for e.g using BASH shell scripts, but that is another story.
Acknowledgement :
I acknowledge the original sources. I have tried my best to provide links to the original articles.
References: