Skip to main content

Select, weight and analyze complex sample data

Project description

Sample Analytics

docs

In large scale surveys, often complex random mechanisms are used to select samples. Estimates derived from such samples must reflect the random mechanism. Samplics is a python package that implements a set of sampling techniques for complex survey designs. These survey sampling techniques are organized into the following four subpackages.

Sampling provides a set of random selection techniques used to draw a sample from a population. It also provides procedures for calculating sample sizes. The sampling subpackage contains:

  • Sample size calculation and allocation: Wald and Fleiss methods for proportions.
  • Equal probability of selection: simple random sampling (SRS) and systematic selection (SYS)
  • Probability proportional to size (PPS): Systematic, Brewer's method, Hanurav-Vijayan method, Murphy's method, and Rao-Sampford's method.

Weighting provides the procedures for adjusting sample weights. More specifically, the weighting subpackage allows the following:

  • Weight adjustment due to nonresponse
  • Weight poststratification, calibration and normalization
  • Weight replication i.e. Bootstrap, BRR, and Jackknife

Estimation provides methods for estimating the parameters of interest with uncertainty measures that are consistent with the sampling design. The estimation subpackage implements the following types of estimation methods:

  • Taylor-based, also called linearization methods
  • Replication-based estimation i.e. Boostrap, BRR, and Jackknife
  • Regression-based e.g. generalized regression (GREG)

Small https://seneweb.com/news/International/france-l-rsquo-esperance-de-vie-et-le-no_n_338526.htmlArea Estimation (SAE). When the sample size is not large enough to produce reliable / stable domain level estimates, SAE techniques can be used to model the output variable of interest to produce domain level estimates. This subpackage provides Area-level and Unit-level SAE methods.

For more details, visit https://samplics.readthedocs.io/en/latest/

Usage

Let's assume that we have a population and we would like to select a sample from it. The goal is to calculate the sample size for an expected proportion of 0.80 with a precision of 0.10.

import samplics
from samplics.sampling import SampleSize

sample_size = SampleSize(parameter = "proportion")
sample_size.calculate(target=0.80, precision=0.10)

Furthermore, the population is located in four natural regions i.e. North, South, East, and West. We could be interested in calculating sample sizes based on region specific requirements e.g. expected proportions, desired precisions and associated design effects.

import samplics
from samplics.sampling import SampleSize

sample_size = SampleSize(parameter="proportion", method="wald", stratification=True)

expected_proportions = {"North": 0.95, "South": 0.70, "East": 0.30, "West": 0.50}
half_ci = {"North": 0.30, "South": 0.10, "East": 0.15, "West": 0.10}
deff = {"North": 1, "South": 1.5, "East": 2.5, "West": 2.0}

sample_size = SampleSize(parameter = "proportion", method="Fleiss", stratification=True)
sample_size.calculate(target=expected_proportions, precision=half_ci, deff=deff)

To select a sample of primary sampling units using PPS method, we can use code similar to:

import samplics
from samplics.sampling import SampleSelection

psu_frame = pd.read_csv("psu_frame.csv")
psu_sample_size = {"East":3, "West": 2, "North": 2, "South": 3}
pps_design = SampleSelection(
   method="pps-sys",
   stratification=True,
   with_replacement=False
   )

frame["psu_prob"] = pps_design.inclusion_probs(
   psu_frame["cluster"],
   psu_sample_size,
   psu_frame["region"],
   psu_frame["number_households_census"]
   )

To adjust the design sample weight for nonresponse, we can use code similar to:

import samplics
from samplics.weighting import SampleWeight

status_mapping = {
   "in": "ineligible",
   "rr": "respondent",
   "nr": "non-respondent",
   "uk":"unknown"
   }

full_sample["nr_weight"] = SampleWeight().adjust(
   samp_weight=full_sample["design_weight"],
   adjust_class=full_sample["region"],
   resp_status=full_sample["response_status"],
   resp_dict=status_mapping
   )

To estimate population parameters, we can use code similar to:

import samplics
from samplics.estimation import TaylorEstimation, ReplicateEstimator

# Taylor-based
zinc_mean_str = TaylorEstimator("mean").estimate(
   y=nhanes2f["zinc"],
   samp_weight=nhanes2f["finalwgt"],
   stratum=nhanes2f["stratid"],
   psu=nhanes2f["psuid"],
   remove_nan=True
)

# Replicate-based
ratio_wgt_hgt = ReplicateEstimator("brr", "ratio").estimate(
   y=nhanes2brr["weight"],
   samp_weight=nhanes2brr["finalwgt"],
   x=nhanes2brr["height"],
   rep_weights=nhanes2brr.loc[:, "brr_1":"brr_32"],
   remove_nan = True
)

To predict small area parameters, we can use code similar to:

import samplics
from samplics.estimation import EblupAreaModel, EblupUnitModel

# Area-level basic method
fh_model_reml = EblupAreaModel(method="REML")
fh_model_reml.fit(
   yhat=yhat, X=X, area=area, intercept=False, error_std=sigma_e, tol=1e-4,
)
fh_model_reml.predict(X=X, area=area, intercept=False)

# Unit-level basic method
eblup_bhf_reml = EblupUnitModel()
eblup_bhf_reml.fit(ys, Xs, areas,)
eblup_bhf_reml.predict(Xmean, areas_list)

Installation

pip install samplics

Python 3.6.1 or newer is required and the main dependencies are numpy, pandas, scpy, and statsmodel.

License

MIT

Contact

created by Mamadou S. Diallo - feel free to contact me!

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samplics-0.3.5.tar.gz (196.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

samplics-0.3.5-py3-none-any.whl (210.1 kB view details)

Uploaded Python 3

File details

Details for the file samplics-0.3.5.tar.gz.

File metadata

  • Download URL: samplics-0.3.5.tar.gz
  • Upload date:
  • Size: 196.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.8.3 Darwin/20.3.0

File hashes

Hashes for samplics-0.3.5.tar.gz
Algorithm Hash digest
SHA256 64d45f8407e092f947fe963c1df97d345e7e3f8e4762eeb6cf556eff2403a7d9
MD5 a8c016d3ffda01b50a07e388a7056bad
BLAKE2b-256 3a377d718d1d38d92ef3fd30cbf7cfb443bc738d3ab1fecf29f2086a7947257a

See more details on using hashes here.

File details

Details for the file samplics-0.3.5-py3-none-any.whl.

File metadata

  • Download URL: samplics-0.3.5-py3-none-any.whl
  • Upload date:
  • Size: 210.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.8.3 Darwin/20.3.0

File hashes

Hashes for samplics-0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4c8fffc1522d8da4e5c1ac4ce6f8d7c3c25ec5293f500392acfe0b5ee3c2b8f0
MD5 f5d4e25e7f4c96a99ffc0ff9d885c3d8
BLAKE2b-256 10b9916ba73cec6deb72c1a56909dbd5922f5aa937b06c54e5d2408744372f54

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page