The majority of sra-tools have the ability to locate and download data from the NCBI SRA on-demand, removing the need for a separate download step, and most importantly downloading only the data that are required. This feature can reduce the bandwidth, storage, and time taken to perform tasks that use less than 100% of the data contained in a run.
sra-tools utilize VDB name resolution, enabling them to accept simple accessions as parameters instead of filesystem paths. The VDB name resolver will generate URLs into the NCBI SRA for any object not found locally, allowing the object to be opened and retrieved over https.
$ prefetch SRR000001 2016-12-01T15:51:52 prefetch.2.8.0: 1) Downloading 'SRR000001'... 2016-12-01T15:51:52 prefetch.2.8.0: Downloading via http... 2016-12-01T15:52:22 prefetch.2.8.0: 1) 'SRR000001' was downloaded successfully
This demonstrates using
prefetch to download a run, in this case over https. [NB - the tool still states that it is using http even though it may be using https. This is a cosmetic defect and will be fixed in the next release.] For higher throughput, Aspera downloads can be used if installed on your system.
The actual file has been downloaded to a cache area in your filesystem:
$ srapath SRR000001 /home/you/ncbi/public/sra/SRR000001.sra
The run file is compressed, occupying about 311M on disk:
$ ls -l /home/you/ncbi/public/sra/SRR000001.sra -rw-rw-r-- 1 you you 325788509 2014-11-19 16:45 /home/you/ncbi/public/sra/SRR000001.sra
Now convert to fastq (NOTE - runs downloaded with prefetch are now located by accession):
$ sff-dump SRR000001 Read 470985 spots for SRR000001 Written 470985 spots for SRR000001
This run contains 454 data with signals. Here it is in SFF format (about 746M):
$ ls -l SRR000001.sff -rw-rw-r-- 1 you you 782054672 2014-11-19 16:59 SRR000001.sff
In this example, the run was first downloaded using
prefetch and stored in the user's public cache. Next, the run was converted into SFF, passing only the simple accession as an argument, but all data were read from cache.
$ cache-mgr --report ----------------------------------- 0 cached file(s) 1 complete file(s) 325,788,509 bytes in cached files 325,788,509 bytes used in cached files 0 lock files
Here, we've checked the contents of our cache. It tells us that there are no partially cached files, 1 complete file (our SRR000001.sra from example 1), and the corresponding bytes. The file was completely downloaded by
Let's clear the cache entirely:
$ cache-mgr --clear ----------------------------------- 1 files removed 0 directories removed 325,788,509 bytes removed
Now, we can run
fastq-dump on the accession without prior download. To verify that the run will be found remotely, we can use
srapath to tell us where the complete object is located:
$ srapath SRR000001 https://sra-download.ncbi.nlm.nih.gov/srapub/SRR000001
We see that the path is now remote. Let's convert on-the-fly:
$ fastq-dump SRR000001 Read 470985 spots for SRR000001 Written 470985 spots for SRR000001
Looking at the fastq file, we can see it is complete:
$ ls -l SRR000001.fastq -rw-rw-r-- 1 you you 301196578 2014-11-19 17:17 SRR000001.fastq $ wc -l SRR000001.fastq 1883940 SRR000001.fastq $ expr 1883940 / 4 470985
Notice that the fastq is slightly smaller than the original SRA file. This is due to the fact that this SRA file also carries 454 signal and clipping data, as well as inlined linker sequences that are not used by fastq. (This is true for all data submitted as SFF.)
Let's look again at the cache contents:
$ cache-mgr --report ----------------------------------- 1 cached file(s) 0 complete file(s) 325,788,832 bytes in cached files 121,351,760 bytes used in cached files 0 lock files
The report tells us that there is 1 partially cached file, and no complete files. This is because
fastq-dump only needs read names, read sequences, and qualities. In this case, the amount of data cached is shown as 121,351,760 bytes, instead of the full 325,788,832 contained in the original SRR.