API#
model_catalogs Python API#
Top-level API#
|
Setup reference catalogs for models. |
|
Find availability for Catalog or Source. |
|
For NOAA OFS unaggregated models: Update urlpath locations in Source. |
Paths available#
Set up for using package.
- model_catalogs.FILE_PATH_AGG_FILE_LOCS(model, model_source, date, is_fore)[source]#
Return filename for aggregated file locations.
Date included to day.
- model_catalogs.FILE_PATH_BOUNDARIES(model)[source]#
Return filename for model boundaries information.
- model_catalogs.FILE_PATH_CATREFS(model, model_source)[source]#
Return filename for model/model_source start time.
- model_catalogs.FILE_PATH_END(model, model_source)[source]#
Return filename for model/model_source end time.
model_catalogs#
Everything dealing with the catalogs.
- model_catalogs.model_catalogs.find_availability(cat_or_source, model_source=None, override=False, verbose=False)[source]#
Find availability for Catalog or Source.
The code will check for previously-calculated availability. If found, the “freshness” of the information is checked as compared with
mc.FRESHparameters specified in__init__.Start and end datetimes are allowed to be calculated separately to save time.
Note that for unaggregated models with forecasts (currently just model_source “coops-forecast-noagg”), this checks availability for the latest forecast, which goes forward in time from today. It is not possible to use this function to check for the case of a forecast forward in time from a past day.
- Parameters:
cat_or_source (Intake catalog or source) – Catalog containing model_source sources for which to find availability, or single Source for which to find availability.
model_source (str, list of strings, optional) – Specified model_source(s) for which to find the availability for a catalog. If unspecified and cat_or_source is a Catalog, loop over all sources in catalog and find availability for all.
override (boolean, optional) – Use override=True to find availability regardless of freshness.
verbose (boolean, optional) – If True, start_datetime and end_datetime found for each Source will be printed.
- Returns:
If a Catalog was input, a Catalog will be returned; if a Source was input, a Source will be returned. For the single input Source or all Sources in the input Catalog, start_datetime and end_datetime are added to metadata.
- Return type:
Intake catalog or source
Examples
Set up source catalog, then find availability for all sources of CIOFS model:
>>> main_cat = mc.setup() >>> cat = mc.find_availability(main_cat['CIOFS'], model_source=['coops-forecast-agg', 'coops-forecast-noagg'])
Find availability for only model_source “coops-forecast-noagg” of CBOFS model, and print it:
>>> source = mc.find_availability(main_cat['CBOFS']['coops-forecast-noagg'], verbose=True) coops-forecast-noagg: 2022-08-22 13:00:00 to 2022-09-25 12:00:00
- model_catalogs.model_catalogs.find_availability_source(source, override=False)[source]#
Find availabililty for source specifically.
This function is called by find_availability() for each source. If
source.statusis False, input source is returned with None for start_datetime and end_datetime.- Parameters:
source (Intake source) – Source for which to find availability.
- Returns:
start_datetime and end_datetime are added to metadata of source.
- Return type:
Intake source
- model_catalogs.model_catalogs.find_datetimes(source, find_start_datetime, find_end_datetime, override=False)[source]#
Find the start and/or end datetimes for source.
For sources with static urlpaths, this opens the Dataset and checks the first time for start_datetime and the last time for end_datetime.
Some NOAA OFS sources require aggregation: model_source names “coops-forecast-noagg” and “ncei-archive-noagg”. For these, the available year and months of the thredd server subcatalogs are found with
find_catrefs(). start_datetime is found by further evaluating to make sure that files in the subcatalogs are both available on the page and that the days represented by model output files are consecutive (there are missing dates). end_datetime is found from the most recent subcatalog files since there aren’t missing files and dates on the recent end of the time ranges.Uses
cf-xarrayto determine the time axis.- Parameters:
source (Intake source) – Model source for which to find start and/or end datetimes
find_start_datetime (bool) – True to calculate start_datetime, otherwise returns None
find_end_datetime (bool) – True to calculate end_datetime, otherwise returns None
override (boolean, optional) – Use override=True to find catrefs regardless of freshness. This is passed in from
find_availability()so has the same value as input there.
- Returns:
Contains ‘start_datetime’ and ‘end_datetime’ where each are strings or can be None if they didn’t need to be found.
- Return type:
tuple
- model_catalogs.model_catalogs.make_catalog(cats, full_cat_name, full_cat_description, full_cat_metadata, cat_driver, cat_path=None, save_catalog=True, return_cat: bool = True)[source]#
Construct single catalog from multiple catalogs or sources.
- Parameters:
cats (list) – List of Intake catalog or source objects that will be combined into a single catalog.
full_cat_name (str) – Name of overall catalog.
full_cat_descrption (str) – Description of overall catalog.
full_cat_metadata (dict) – Dictionary of metadata for overall catalog.
cat_driver (str or Intake object or list) –
Driver to apply to all catalog entries. For example:
intake.catalog.local.YAMLFileCatalog
’opendap’
If list, must be same length as cats and contains drivers that correspond to cats.
cat_path (Path object, optional) – Path with catalog name to use for saving catalog. With or without yaml suffix. If not provided, will use full_cat_name.
save_catalog (bool, optional) – Defaults to True, and saves to cat_path.
return_cat (bool, optional) – Return catalog.
- Returns:
A single catalog made from multiple catalogs or sources.
- Return type:
Intake Catalog
Examples
Make catalog:
>>> make_catalog([list of Intake sources or catalogs], 'catalog name', 'catalog desc', {}, 'opendap', save_catalog=False)
- model_catalogs.model_catalogs.open_catalog(cat_loc, return_cat=True, save_catalog=False, override=False, boundaries=False, save_boundaries=False)[source]#
Open an intake catalog file and set up code to apply processing/transform.
Optionally calculate the boundaries of the model represented in cat_log.
Note that saved boundaries files will be saved under the name inside the catalog, not the name of the file if you input a catalog path.
- Parameters:
cat_loc (str, Catalog) – The catalog to open. cat_loc can be the representation of a path to a catalog file (string or Path) or it can be a Catalog object.
return_cat (bool, optional) – Return catalog from function. Defaults to True.
save_catalog (bool, optional) – Defaults to False, and saves to mc.CACHE_PATH_COMPILED(model).
override (boolean, optional) – Use override=True to calculate boundaries of the model regardless of whether the file already exists.
boundaries (boolean, optional) – If True, find previously-saved or calculate domain boundary of model.
save_boundaries (bool, optional) – Defaults to False, and saves to mc.FILE_PATH_BOUNDARIES(model).
- model_catalogs.model_catalogs.select_date_range(cat_or_source, start_date, end_date=None, model_source=None, use_forecast_files=None, override=False)[source]#
For NOAA OFS unaggregated models: Update urlpath locations in Source.
For other models, set up so that start_date and end_date are used to filter resulting Dataset in time. For all models, save start_date and end_date in the Source metadata.
NOAA OFS model sources that require aggregation (currently for model_sources “coops-forecast-noagg” and “ncei-archive-noagg”) need to have the specific file paths found for each file that will be read in. This function does that, based on the desired date range, and returns a Source with file locations in the urlpath. This function can also be used with any model that does not require this (because the model paths are either static or deterministic) but in those cases it does not need to be used; they will have the start and end dates applied to filter the resulting model output after
to_dask()is called.- Parameters:
cat_or_source (Intake catalog or source) – Catalog containing model_source sources, or single Source.
start_date (datetime-interpretable str or pd.Timestamp) – Date (and possibly time) of start to desired model date range. If input date does not include a time, times will be included from the start of the day. If a time is input in start_date, it is used to narrow the time range of the results.
end_date (datetime-interpretable str, pd.Timestamp, or None; optional) –
Date (and possibly time) of start to desired model date range. If input date does not include a time, times will be included from the start of the day. If a time is input in start_date, it is used to narrow the time range of the results. end_date can be None which indicates the user wants all available model output after start_date; this optional is not available for unaggregated historical NOAA OFS models which do not contain forecast files (i.e., model_source “ncei-archive-noagg”).
There are several use cases to specify:
if start_date == end_date, the full day of model output from the date is selected. If the date specified is today and all times for today are not yet available, output from forecast files will be used to fill out the day after the nowcast files end.
If end_date is None, all available model output will be retrieved starting at start_date. This option doesn’t work for archival unaggregated NOAA OFS models currently.
If end_date is in the future, use_forecast_files is set to True and the forecast is read in, but stopped at end_date.
User can set use_forecast_files=True with an end_date in the past to get old forecast model results for end_date for unaggregated NOAA OFS models. This case is probably not well-used and is not regularly tested. The results from using this combination of inputs does not align with the results of
mc.find_availability()since the forecast is not the latest.
model_source (str, optional) – Which model_source to use. If
mc.find_availability()has been run, the code will determine which model_source in the Catalog to use based on start_date and end_date. Otherwise a single model_source can be provided, orfind_availability()will be run if needed. An exception is if there is only one model_source available for cat, that one will be used without specifying it.use_forecast_files (bool or None, optional) – This parameter is typically set by the code and is not used by the user. However, in one use case the user can input use_forecast_files=True: when they want to read in a forecast from the past for a NOAA OFS model. Otherwise do not use this parameter directly.
override (boolean, optional) – Use override=True to find catrefs regardless of freshness.
- Returns:
Intake Source associated with the catalog entry which now contains source.metadata[‘start_date’] and source.metadata[‘end_date’]. The values of source.metadata[‘start/end_date’] will not necessarily be the same as the input start_date and end_date, but may be changed to return the desired output time range. For unaggregated NOAA OFS models, the returned Source will have updated source.urlpath to reflect the newly-found file paths of the selected date range.
- Return type:
Intake Source
Examples
Find model ‘LMHOFS’ urlpaths for all of today through all available forecast, directly from source catalog without first searching for availability with
mc.find_availability():>>> main_cat = mc.setup() >>> today = pd.Timestamp.today() >>> source = mc.select_date_range(main_cat["LMHOFS"]["coops-forecast-noagg"], start_date=today, end_date=None)
Find urlpaths with
select_date_rangeand have it runfind_availability():>>> source = mc.select_date_range(main_cat['LMHOFS'], start_date=today, end_date=today)
- model_catalogs.model_catalogs.setup(locs='mc_', override=False, boundaries=True)[source]#
Setup reference catalogs for models.
Loops over catalogs that have been previously installed as data packages to intake that start with the string(s) in locs. The default is to read in the required GOODS model catalogs which are prefixed with “mc_”. Alternatively, one or more local catalog files can be input as strings or Paths.
This function calls
open_catalogwhich reads in previously-saved model boundary information (or calculates it if not available) and saves temporary catalog files for each model (called “compiled”), then this function links those together into the returned main catalog. For some models, reading in the original catalogs applies a “today” and/or “yesterday” date Intake user parameter that supplies two example model files that can be used for examining the model output for the example times. Those are rerun each time this function is rerun, filling the parameters using the proper dates.Note that saved compiled catalog files will be saved under the name inside the catalog, not the name of the file if you input a catalog path.
- Parameters:
locs (str, Path, list) –
This can be:
a string or Path describing where a Catalog file is located
a string of the prefix for selecting catalogs from the default intake catalog,
intake.cat. It is expected to be of the form “PREFIX_CATALOGNAME” with an underscore at the end followed by the catalog name, and there could be many catalogs with that “PREFIX_” set up.a list of a combination of the previous options.
override (boolean, optional) – Use override=True to compile the catalog files together regardless of freshness.
boundaries (boolean, optional) – If True, find previously-saved or calculate domain boundary of model. Intended for testing.
- Returns:
Nested Intake catalog with a catalog for each input option. Each model in turn has one or more model_source available (e.g., “coops-forecast-agg”, “coops-forecast-noagg”).
- Return type:
Intake catalog
Examples
Set up main catalog:
>>> main_cat = mc.setup()
Examine list of models available in catalog:
>>> list(main_cat)
Examine the model_sources for a specific model in the catalog:
>>> list(main_cat['CBOFS'])
Separate from
model_catalogsyou can check the default Intake catalog with:>>> list(intake.cat)
- model_catalogs.model_catalogs.transform_source(source_orig)[source]#
Set up transform of original catalog source
- Parameters:
source_orig (Intake source) – Original source, which will be transformed
- Returns:
source_transform, the transformed version of source_orig. This source will point at the source of source_orig as the target.
- Return type:
Intake source
utils#
Utilities to help with catalogs.
- model_catalogs.utils.agg_for_date(date, strings, filetype, is_forecast=False, pattern=None)[source]#
Select NOAA OFS-style nowcast/forecast files for aggregation.
This function finds the files whose path includes the given date, regardless of times which might change the date forward or backward.
- Parameters:
date (str of datetime, pd.Timestamp) – Date of day to find model output files for. Doesn’t pay attention to hours/minutes seconds.
strings (list) – List of strings to be filtered. Expected to be file locations from a thredds catalog.
filetype (str) – Which filetype to use. Every NOAA OFS model has “fields” available, but some have “regulargrid” or “2ds” also. This availability information is in the catalog metadata for the model under filetypes metadata.
is_forecast (bool, optional) – If True, then date is the last day of the time period being sought and the forecast files should be brought in along with the nowcast files, to get the model output the length of the forecast out in time. The forecast files brought in will have the latest timing cycle of the day that is available. If False, all nowcast files (for all timing cycles) are brought in.
pattern (str, optional) – If a model file pattern doesn’t match that assumed in this code, input one that will work. Currently only NYOFS doesn’t match but the pattern is built into the catalog file.
- Returns:
Contains URLs for where to find all of the model output files that match the keyword arguments. List is not sorted correctly for times (this happens later).
- Return type:
List
- model_catalogs.utils.astype(value, type_)[source]#
Return value as type type_.
Particularly made to work correctly for returning string, PosixPath, or Timestamp as list.
- model_catalogs.utils.calculate_boundaries(cats, save_files=True, return_boundaries=False)[source]#
Calculate boundary information for all models.
This loops over all input catalogs and will try with multiple model_source if necessary (in case servers aren’t working) to access the example model output files and calculate the bounding box and numerical domain boundary. The numerical domain boundary is calculated using alpha_shape with previously-chosen parameters stored in the original model catalog files. The bounding box and boundary string representation (as WKT) are then saved to files.
The files are saved the first time you run this function, so this function should only be rerun if you suspect that a model domain has changed or you have a new model catalog.
- Parameters:
cats (Catalog, list of Catalogs) – The Catalog or Catalogs for which to find boundaries.
save_files (boolean, optional) – Whether to save files or not. Defaults to True. Saves to
mc.FILE_PATH_BOUNDARIES(cat_loc.name).return_boundaries (boolean, optional) – Whether to return boundaries information from this call. Defaults to False.
Examples
Calculate boundary information for CBOFS:
>>> import model_catalogs as mc >>> main_cat = mc.setup() >>> mc.calculate_boundaries(main_cat["CBOFS"])
- model_catalogs.utils.file2dt(filename)[source]#
Return Timestamp of NOAA OFS filename
…without reading in the filename to xarray. See docs for details on the formula. Most NOAA OFS models have 1 timestep per file, but NYOFS has 6.
- Parameters:
filename (str) – Filename for which to decipher datetime. Can be full path or just file name.
- Returns:
pandas Timestamp of the time(s) in the file.
- Return type:
Timestamp
Examples
>>> url = 'https://www.ncei.noaa.gov/thredds/dodsC/model-cbofs-files/2022/07/nos.cbofs.fields.n001.20220701.t00z.nc' >>> mc.filename2datetime(url) Timestamp('2022-06-30 19:00:00')
- model_catalogs.utils.filedates2df(filelocs)[source]#
Set up dataframe of datetimes to filenames.
- Parameters:
filelocs (list of str) – File locations.
- Returns:
Contains the index datetimes corresponding to file locations (column ‘filenames’).
- Return type:
DataFrame
- model_catalogs.utils.find_bbox(ds, dd=None, alpha=None)[source]#
Determine bounds and boundary of model.
- Parameters:
ds (Dataset) – xarray Dataset containing model output.
dd (int, optional) – Number to decimate model output lon/lat, as a stride.
alpha (float, optional) – Number for alphashape to determine what counts as the convex hull. Larger number is more detailed, 1 is a good starting point.
- Returns:
Contains the name of the longitude and latitude variables for ds, geographic bounding box of model output ([min_lon, min_lat, max_lon, max_lat]), low res and high res wkt representation of model boundary.
- Return type:
List
- model_catalogs.utils.find_catrefs(catloc)[source]#
Find hierarchy of catalog references for thredds catalog.
- Parameters:
catloc (str) – Search in thredds catalog structure from base catalog, catloc.
- Returns:
Contains tuples containing the hierarchy of directories in the thredds catalog structure to get to where the datafiles start.
- Return type:
list
- model_catalogs.utils.find_filelocs(catref, catloc, filetype='fields')[source]#
Find thredds file locations.
- Parameters:
catref (tuple) – 2 or 3 labels describing the directories from catlog to get the data locations.
catloc (str) – Base thredds catalog location.
filetype (str) – Which filetype to use. Every NOAA OFS model has “fields” available, but some have “regulargrid” or “2ds” also (listed in separate catalogs in the model name).
- Returns:
Locations of files found from catloc to hierarchical location described by catref.
- Return type:
list
- model_catalogs.utils.get_fresh_parameter(filename, source)[source]#
Get freshness parameter, based on the filename.
A freshness parameter is stored in
__init__for required scenarios which is looked up using the logic in this function, based on the filename. The source is checked for most types of actions for an overriding freshness parameter value, otherwise the default is used.- Parameters:
filename (Path) – Filename to determine freshness.
source (Intake Source) – Source from which to check for an overriding freshness parameter. Is not used for “compiled” catalog files.
- Returns:
mu, a pandas Timedelta-interpretable string describing the amount of time that filename should be considered fresh before needing to be recalculated.
- Return type:
str
- model_catalogs.utils.is_fresh(filename, source=None)[source]#
Check if file called filename is fresh.
If filename doesn’t exist, return False.
- Parameters:
filename (Path) – Filename to determine freshness
source (Intake Source) – Source from which to check for an overriding freshness parameter. Is not used for “compiled” catalog files.
- Returns:
True if fresh and False if not or if filename is not found.
- Return type:
Boolean
Transforming Datasets with process#
This file contains all information for transforming the Datasets.
- class model_catalogs.process.DatasetTransform(*args, **kwargs)[source]#
Bases:
GenericTransformTransform where the input and output are both Dask-compatible Datasets
This derives from GenericTransform, and you must supply
transformand anytransform_kwargs.- Attributes:
- cache
- cache_dirs
- cat
- classname
datesDates associated with urlpath files
- description
- dtype
- entry
guiSource GUI, with parameter selection and plotting
- has_been_persisted
hvplotalias for
DataSource.plot- is_persisted
- name
plotPlot API accessor
plotsList custom associated quick-plots
- shape
statusStatus of server for source.
targetConnect target into Transform
urlpathData location for target
- version
Methods
__call__(**kwargs)Create a new instance of this source with altered arguments
close()Close open resources corresponding to this data source.
configure_new(**kwargs)Create a new instance of this source with altered arguments
describe()Description from the entry spec
discover()Open resource and populate the source attributes.
export(path, **kwargs)Save this data for sharing with other people
get(**kwargs)Create a new instance of this source with altered arguments
persist([ttl])Save data from this source to local persistent storage
read()Same here.
Return iterator over container fragments of data source
Return a part of the data corresponding to i-th partition.
to_dask()Makes it so can read in model output.
to_spark()Provide an equivalent data object in Apache Spark
Update urlpath for transform.
yaml()Return YAML representation of this data-source
get_persisted
set_cache_dir
- property cache#
- property cache_dirs#
- cat = None#
- property classname#
- close()#
Close open resources corresponding to this data source.
- configure_new(**kwargs)#
Create a new instance of this source with altered arguments
Enables the picking of options and re-evaluating templates from any user-parameters associated with this source, or overriding any of the init arguments.
Returns a new data source instance. The instance will be recreated from the original entry definition in a catalog if this source was originally created from a catalog.
- container = 'xarray'#
- property dates#
Dates associated with urlpath files
…if there is more than one. Doesn’t work for static links or RTOFS models. So, this is really for NOAA OFS models.
- Returns:
Ordered dates to match urlpath locations.
- Return type:
list
- describe()#
Description from the entry spec
- description = None#
- discover()#
Open resource and populate the source attributes.
- dtype = None#
- property entry#
- export(path, **kwargs)#
Save this data for sharing with other people
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (
catalog.add(source)) or get its YAML representation (.yaml()).
- get(**kwargs)#
Create a new instance of this source with altered arguments
Enables the picking of options and re-evaluating templates from any user-parameters associated with this source, or overriding any of the init arguments.
Returns a new data source instance. The instance will be recreated from the original entry definition in a catalog if this source was originally created from a catalog.
- get_persisted()#
- property gui#
Source GUI, with parameter selection and plotting
- property has_been_persisted#
The base class does not interact with persistence
- property hvplot#
alias for
DataSource.plot
- input_container = 'xarray'#
- property is_persisted#
The base class does not interact with persistence
- name = None#
- npartitions = 0#
- on_server = False#
- optional_params = {}#
Perform an arbitrary function to transform an input
- transform: function to perform transform
function(container_object) -> output, or a fully-qualified dotted string pointing to it
- transform_params: dict
The keys are names of kwargs to pass to the transform function. Values are either concrete values to pass; or param objects which can be made into widgets (but must have a default value) - or a spec to be able to make these objects.
- allow_dask: bool (optional, default True)
Whether to_dask() is expected to work, which will in turn call the target’s to_dask()
- partition_access = False#
- persist(ttl=None, **kwargs)#
Save data from this source to local persistent storage
- Parameters:
ttl (numeric, optional) – Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than
ttlseconds have passed since the old persisted version was written.kargs (passed to the _persist method on the base container.) –
- property plot#
Plot API accessor
This property exposes both predefined plots (described in the source metadata) and general-purpose plotting via the hvPlot library. Supported containers are: array, dataframe and xarray,
To display in a notebook, be sure to run
intake.output_notebook()first.The set of plots defined for this source can be found by
>>> source.plots ["plot1", "plot2"]
and to display one of these:
>>> source.plot.plot1() <holoviews/panel output>
To create new plot types and supply custom configuration, use one of the methods of
hvplot.hvPlot:>>> source.plot.line(x="fieldX", y="fieldY")
The full set of arguments that can be passed, and the types of plot they refer to, can be found in the doc and attributes of
hvplot.HoloViewsConverter.Once you have found a suitable plot, you may wish to update the plots definitions of the source. Simply add the
plotname=optional argument (this will overwrite any existing plot of that name). The source’s YAML representation will include the new plot, and it could be saved into a catalog with this new definition.>>> source.plot.line(plotname="new", x="fieldX", y="fieldY"); >>> source.plots ["plot1", "plot2", "new"]
- property plots#
List custom associated quick-plots
- read_chunked()#
Return iterator over container fragments of data source
- read_partition(i)#
Return a part of the data corresponding to i-th partition.
By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.
- required_params = ['transform', 'transform_kwargs']#
- set_cache_dir(cache_dir)#
- shape = None#
- property status#
Status of server for source.
- Returns:
If True, server was reachable.
- Return type:
bool
- property target#
Connect target into Transform
This way can expose some information to query. This will only run once per object.
- Returns:
Source that is the target of the object Transform
- Return type:
Intake Source
- to_dask()[source]#
Makes it so can read in model output.
- Returns:
xarray Dataset that has been read in
- Return type:
Dataset
- to_spark()#
Provide an equivalent data object in Apache Spark
The mapping of python-oriented data containers to Spark ones will be imperfect, and only a small number of drivers are expected to be able to produce Spark objects. The standard arguments may b translated, unsupported or ignored, depending on the specific driver.
This method requires the package intake-spark
- update_urlpath()[source]#
Update urlpath for transform.
Run this in select_date_range for aggregated sources. This can be run more than once.
- property urlpath#
Data location for target
Can be overwritten by update_urlpath
- Returns:
Location(s) for where data can be found
- Return type:
list
- version = None#
- yaml()#
Return YAML representation of this data-source
The output may be roughly appropriate for inclusion in a YAML catalog. This is a best-effort implementation
- model_catalogs.process.add_attributes(ds, metadata: Optional[dict] = None)[source]#
Update Dataset metadata.
Update the Dataset metadata with metadata passed in from catalog files.
- Parameters:
ds (Dataset) – xarray Dataset containing model output.
metadata (dict, optional) – Metadata that has processing information to apply
- Returns:
Improved Dataset.
- Return type:
Dataset