CollecTor¶
Descriptor archives are available from CollecTor. If you need Tor’s topology at a prior point in time this is the place to go!
With CollecTor you can either read descriptors directly…
import datetime
import stem.descriptor.collector
yesterday = datetime.datetime.utcnow() - datetime.timedelta(days = 1)
# provide yesterday's exits
exits = {}
for desc in stem.descriptor.collector.get_server_descriptors(start = yesterday):
if desc.exit_policy.is_exiting_allowed():
exits[desc.fingerprint] = desc
print('%i relays published an exiting policy today...\n' % len(exits))
for fingerprint, desc in exits.items():
print(' %s (%s)' % (desc.nickname, fingerprint))
… or download the descriptors to disk and read them later.
import datetime
import stem.descriptor
import stem.descriptor.collector
yesterday = datetime.datetime.utcnow() - datetime.timedelta(days = 1)
cache_dir = '~/descriptor_cache/server_desc_today'
collector = stem.descriptor.collector.CollecTor()
for f in collector.files('server-descriptor', start = yesterday):
f.download(cache_dir)
# then later...
for f in collector.files('server-descriptor', start = yesterday):
for desc in f.read(cache_dir):
if desc.exit_policy.is_exiting_allowed():
print(' %s (%s)' % (desc.nickname, desc.fingerprint))
get_instance - Provides a singleton CollecTor used for...
|- get_server_descriptors - published server descriptors
|- get_extrainfo_descriptors - published extrainfo descriptors
|- get_microdescriptors - published microdescriptors
|- get_consensus - published router status entries
|
|- get_key_certificates - authority key certificates
|- get_bandwidth_files - bandwidth authority heuristics
+- get_exit_lists - TorDNSEL exit list
File - Individual file residing within CollecTor
|- read - provides descriptors from this file
+- download - download this file to disk
CollecTor - Downloader for descriptors from CollecTor
|- get_server_descriptors - published server descriptors
|- get_extrainfo_descriptors - published extrainfo descriptors
|- get_microdescriptors - published microdescriptors
|- get_consensus - published router status entries
|
|- get_key_certificates - authority key certificates
|- get_bandwidth_files - bandwidth authority heuristics
|- get_exit_lists - TorDNSEL exit list
|
|- index - metadata for content available from CollecTor
+- files - files available from CollecTor
New in version 1.8.0.
-
stem.descriptor.collector.
get_instance
()[source]¶ Provides the singleton
CollecTor
used for this module’s shorthand functions.Returns: singleton CollecTor
instance
-
stem.descriptor.collector.
get_server_descriptors
(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶ Shorthand for
get_server_descriptors()
on our singleton instance.
-
stem.descriptor.collector.
get_extrainfo_descriptors
(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶ Shorthand for
get_extrainfo_descriptors()
on our singleton instance.
-
stem.descriptor.collector.
get_microdescriptors
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Shorthand for
get_microdescriptors()
on our singleton instance.
-
stem.descriptor.collector.
get_consensus
(start=None, end=None, cache_to=None, document_handler='ENTRIES', version=3, microdescriptor=False, bridge=False, timeout=None, retries=3)[source]¶ Shorthand for
get_consensus()
on our singleton instance.
-
stem.descriptor.collector.
get_key_certificates
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Shorthand for
get_key_certificates()
on our singleton instance.
-
stem.descriptor.collector.
get_bandwidth_files
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Shorthand for
get_bandwidth_files()
on our singleton instance.
-
stem.descriptor.collector.
get_exit_lists
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Shorthand for
get_exit_lists()
on our singleton instance.
-
class
stem.descriptor.collector.
File
(path, types, size, sha256, first_published, last_published, last_modified)[source]¶ Bases:
object
File within CollecTor.
Variables: - path (str) – file path within collector
- types (tuple) – descriptor types contained within this file
- compression (stem.descriptor.Compression) – file compression, None if this cannot be determined
- size (int) – size of the file
- sha256 (str) – file’s sha256 checksum
- start (datetime) – first publication within the file, None if this cannot be determined
- end (datetime) – last publication within the file, None if this cannot be determined
- last_modified (datetime) – when the file was last modified
-
read
(directory=None, descriptor_type=None, start=None, end=None, document_handler='ENTRIES', timeout=None, retries=3)[source]¶ Provides descriptors from this archive. Descriptors are downloaded or read from disk as follows…
- If this file has already been downloaded through :func:`~stem.descriptor.collector.CollecTor.download’ these descriptors are read from disk.
- If a directory argument is provided and the file is already present these descriptors are read from disk.
- If a directory argument is provided and the file is not present the file is downloaded this location then read.
- If the file has neither been downloaded and no directory argument is provided then the file is downloaded to a temporary directory that’s deleted after it is read.
Parameters: - directory (str) – destination to download into
- descriptor_type (str) – descriptor type, this is guessed if not provided
- start (datetime.datetime) – publication time to begin with
- end (datetime.datetime) – publication time to end with
- document_handler (stem.descriptor.__init__.DocumentHandler) – method in
which to parse a
NetworkStatusDocument
- timeout (int) – timeout when connection becomes idle, no timeout applied if None
- retries (int) – maximum attempts to impose
Returns: iterator for
Descriptor
instances in the fileRaises: - ValueError if unable to determine the descirptor type
- TypeError if we cannot parse this descriptor type
DownloadFailed
if the download fails
-
download
(directory, decompress=True, timeout=None, retries=3, overwrite=False)[source]¶ Downloads this file to the given location. If a file already exists this is a no-op.
Parameters: - directory (str) – destination to download into
- decompress (bool) – decompress written file
- timeout (int) – timeout when connection becomes idle, no timeout applied if None
- retries (int) – maximum attempts to impose
- overwrite (bool) – if this file exists but mismatches CollecTor’s checksum then overwrites if True, otherwise rases an exception
Returns: str with the path we downloaded to
Raises: DownloadFailed
if the download fails- IOError if a mismatching file exists and overwrite is False
-
class
stem.descriptor.collector.
CollecTor
(retries=2, timeout=None)[source]¶ Bases:
object
Downloader for descriptors from CollecTor. The contents of CollecTor are provided in an index that’s fetched as required.
Variables: - retries (int) – number of times to attempt the request if downloading it fails
- timeout (float) – duration before we’ll time out our request
-
get_server_descriptors
(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶ Provides server descriptors published during the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) – publication time to begin with
- end (datetime.datetime) – publication time to end with
- cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
- bridge (bool) – standard descriptors if False, bridge if True
- timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) – maximum attempts to impose on a per-archive basis
Returns: iterator of
ServerDescriptor
for the given time rangeRaises: DownloadFailed
if the download fails
-
get_extrainfo_descriptors
(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶ Provides extrainfo descriptors published during the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) – publication time to begin with
- end (datetime.datetime) – publication time to end with
- cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
- bridge (bool) – standard descriptors if False, bridge if True
- timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) – maximum attempts to impose on a per-archive basis
Returns: iterator of
RelayExtraInfoDescriptor
for the given time rangeRaises: DownloadFailed
if the download fails
-
get_microdescriptors
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Provides microdescriptors estimated to be published during the given time range, sorted oldest to newest. Unlike server/extrainfo descriptors, microdescriptors change very infrequently…
"Microdescriptors are expected to be relatively static and only change about once per week." -dir-spec section 3.3
CollecTor archives only contain microdescriptors that change, so hourly tarballs often contain very few. Microdescriptors also do not contain their publication timestamp, so this is estimated.
Parameters: - start (datetime.datetime) – publication time to begin with
- end (datetime.datetime) – publication time to end with
- cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
- timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) – maximum attempts to impose on a per-archive basis
Returns: iterator of :class:`~stem.descriptor.microdescriptor.Microdescriptor for the given time range
Raises: DownloadFailed
if the download fails
-
get_consensus
(start=None, end=None, cache_to=None, document_handler='ENTRIES', version=3, microdescriptor=False, bridge=False, timeout=None, retries=3)[source]¶ Provides consensus router status entries published during the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) – publication time to begin with
- end (datetime.datetime) – publication time to end with
- cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
- document_handler (stem.descriptor.__init__.DocumentHandler) – method in
which to parse a
NetworkStatusDocument
- version (int) – consensus variant to retrieve (versions 2 or 3)
- microdescriptor (bool) – provides the microdescriptor consensus if True, standard consensus otherwise
- bridge (bool) – standard descriptors if False, bridge if True
- timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) – maximum attempts to impose on a per-archive basis
Returns: iterator of
RouterStatusEntry
for the given time rangeRaises: DownloadFailed
if the download fails
-
get_key_certificates
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Directory authority key certificates for the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) – publication time to begin with
- end (datetime.datetime) – publication time to end with
- cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
- timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) – maximum attempts to impose on a per-archive basis
Returns: iterator of :class:`~stem.descriptor.networkstatus.KeyCertificate for the given time range
Raises: DownloadFailed
if the download fails
-
get_bandwidth_files
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Bandwidth authority heuristics for the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) – publication time to begin with
- end (datetime.datetime) – publication time to end with
- cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
- timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) – maximum attempts to impose on a per-archive basis
Returns: iterator of :class:`~stem.descriptor.bandwidth_file.BandwidthFile for the given time range
Raises: DownloadFailed
if the download fails
-
get_exit_lists
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ TorDNSEL exit lists for the given time range, sorted oldest to newest.
Parameters: - start (datetime.datetime) – publication time to begin with
- end (datetime.datetime) – publication time to end with
- cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
- timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
- retries (int) – maximum attempts to impose on a per-archive basis
Returns: iterator of :class:`~stem.descriptor.tordnsel.TorDNSEL for the given time range
Raises: DownloadFailed
if the download fails
-
index
(compression='best')[source]¶ Provides the archives available in CollecTor.
Parameters: compression (descriptor.Compression) – compression type to download from, if undefiled we’ll use the best decompression available
Returns: dict with the archive contents
Raises: If unable to retrieve the index this provide…
- ValueError if json is malformed
- IOError if unable to decompress
DownloadFailed
if the download fails
-
files
(descriptor_type=None, start=None, end=None)[source]¶ Provides files CollecTor presently has, sorted oldest to newest.
Parameters: - descriptor_type (str) – descriptor type or prefix to retrieve
- start (datetime.datetime) – publication time to begin with
- end (datetime.datetime) – publication time to end with
Returns: list of
File
Raises: If unable to retrieve the index this provide…
- ValueError if json is malformed
- IOError if unable to decompress
DownloadFailed
if the download fails