Our evolving document understanding technology is based on open-source software and curated datasets that are proprietary, creative commons, or in the public domain. Language models pre-trained on web-scale data (Common Crawl, WebText) are fine-tuned to legal, financial, regulatory and client domain documents to achieve high task accuracies.
Open-Source Software
Artificial intelligence and natural language processing applications became commercially practical only recently, in the past decade, with advances in deep learning and the availability of the data and compute resources that deep learning demands.
The following are machine learning, big data, NLP, and deep learning development libraries. We work mostly with Python and Java-based libraries.
- Scikit-Learn (https://scikit-learn.org)
- Apache Spark (https://spark.apache.org)
- spaCy (https://spacy.io)
- Hugging Face (https://huggingface.co)
- TensorFlow (https://www.tensorflow.org)
- Keras (https://keras.io)
Datasets
Large open source, creative commons, or public domain datasets are key to the training and development of AI/ML/NLP applications.
- Wikidata (https://www.wikidata.org)
- Common Crawl (https://commoncrawl.org)
- WebText (https://openwebtext2.readthedocs.io)
- Google AI Natural questions (https://ai.google.com/research/NaturalQuestions)
- Europarl (https://www.statmt.org/europarl)
- Universal Dependencies (https://universaldependencies.org)
- WordNet (https://wordnet.princeton.edu)
- Project Gutenberg (https://www.gutenberg.org)
- CFPB Consumer Complaint Database (https://www.consumerfinance.gov/data-research/consumer-complaints)
- FCC CGB – Consumer Complaints Data (https://catalog.data.gov/is/dataset/cgb-consumer-complaints-data)
- FERC Data Sources (https://www.ferc.gov/industries-data/resources/data-sources)
- USPTO Bulk Data Storage System (https://bulkdata.uspto.gov)
Regulations
Regulations are issued by governments, industry or standards bodies. Below we have European Union and United States government regulations, in the public domain, in finance, telecommunications and energy.
- U.S. SEC Financial Statement and Notes Data Sets (https://www.sec.gov/dera/data/financial-statement-and-notes-data-set.html)
- U.S. SEC EDGAR (https://www.sec.gov/os/accessing-edgar-data)
- U.S. FCC Rules & Regulations (https://www.fcc.gov/wireless/bureau-divisions/technologies-systems-and-innovation-division/rules-regulations-title-47)
- U.S. FERC Federal Statutes (https://www.ferc.gov/enforcement-legal/legal)
- Basel international regulatory framework for banks (https://www.bis.org/bcbs/basel3.htm)
- U.S. Federal Reserve Basel Regulatory Framework (https://www.federalreserve.gov/supervisionreg/basel/basel-default.htm)
- U.S. Code of Federal Regulations (https://www.govinfo.gov/app/collection/cfr)
- U.K. Finance regulatory (https://www.ukfinance.org.uk/data-and-research)
Explainability, bias and societal issues
We carefully monitor and evaluate issues of explainability, bias, fairness, future of work and other societal consequences that the development and use of artificial intelligence poses.
- MIT Work of the Future (https://workofthefuture.mit.edu)
- Stanford Human-Centered AI (https://hai.stanford.edu)
- Oxford Future of Humanity Institute (https://www.fhi.ox.ac.uk)
- U.S. National Artificial Intelligence Initiative (https://www.ai.gov)
- European Commission (https://digital-strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence)
Development of artificial intelligence and natural language processing applications requires teams with the business experience, AI/ML/NLP skillset and familiarity with the resources available and issues posed. Andinum is a partner with the expertise necessary to take advantage of the opportunity.
To stay informed about our technology and use cases of AI for digital transformation, compliance and regulatory technology, subscribe to our Technology mailing list.