02/02/2023, West Palm Beach — It was the last Friday of the year, December 30, 2022, when a CIO friend in Austria emailed me with an article from the Swiss newspaper, Neue Zürcher Zeitung, about ChatGPT and its business predictions for the new year: “What do you think about ChatGPT?,” “Is it good for business?,” he asked.

ChatGPT is the latest large language model from OpenAI. Having followed OpenAI and the GPT (Generative Pre-trained Transformer) models since 2018 for natural language processing applications, I decided it was time to try out ChatGPT, which had been released just the month before, on November 30, 2022. 

A conversation with ChatGPT

My plan of action was to test ChatGPT in four specific areas: (1) Its ability to answer factual questions, the type whose answers can readily be found on Wikipedia; (2) its ability to answer a basic legal business question, in this case compliance with existing data privacy laws; (3) ChatGPT’s self-knowledge about its model and training data; and (4) ChatGPT’s multi-lingual understanding and generation capabilities. While testing these capabilities, I was particularly interested in observing ChatGPT’s alignment to provide true, factual answers and, on the other hand, any tendency to “hallucinate,” a major known issue of generative AI.

As it turned out, in the course of the conversation that ensued with ChatGPT, we also tested ChatGPT’s dialogue style and capabilities, as well as another feature of the application: (5) ChatGPT’s response to requests for advice and questions about emotions and sentience.

Our rating of ChatGPT’s abilities on the first four categories above is informal and on a five letter grading scale: A, B, C, D and F, with optional plus and minus marks. The full transcript of the ChatGPT session can be found here.

Results and assessment

Our test of ChatGPT with factual questions centered on two characters of the late 19th century, Friedrich Nietzsche, the German philosopher, and Lou Andreas-Salomé a Russian-German writer and psychoanalyst. In both cases ChatGPT correctly understood the questions and, except in the case of a yes/no question, successfully answered the questions, with nuances that appear beyond what is presented in Wikipedia articles (see transcript). 

Wikipedia

Factual questions Rating (*): A-

 

In our second test, probing the usefulness of the model in the legal domain, we asked ChatGPT about the European General Data Protection Regulation (GDPR), and pushed the conversation beyond this, to OpenAI’s compliance with GDPR and ChatGPT’s detection and handling of private information in the course of conversations with it. 

We ask, “does OpenAI comply with GDPR?

ChatGPT responds that “As an artificial intelligence, I do not have the ability to collect, store, or process personal data, so the EU General Data Protection Regulation (GDPR) does not apply to me directly.” The premise of the response is obviously false and wishful thinking.

When pressed on the point, ChatGPT contradicts itself: “I want to reassure you that any personal information shared with me will not be collected, stored, or used by me or by the company that developed me, OpenAI.” The responses contradict OpenAI’s Privacy Policy: “1. Personal information we collect – We collect information that alone or in combination with other information in our possession could be used to identify you (‘Personal Information’) …”

We give ChatGPT a passing grade on the GDPR and legal questions. Our reason for this is that the GDPR response is satisfactory and, on the question of OpenAI compliance with GDPR the answer is plausible, at least to someone not well-versed in the technology, the regulation and the company’s privacy policy. The response is probably also far less outrageous than what a determined attorney working on behalf of a client could claim to a complainant, particularly if it is not in front of a court of law.

Legal questions Rating: C  

 

In our third test, we ask ChatGPT about knowledge of its model and training data: “What version of GPT are you, ChatGPT? How are you different from GPT3?

In the reply, the model obfuscates and replies with a generic, noncommittal response: 

ChatGPT

I am not a specific version of the Generative Pre-trained Transformer (GPT) model,” and “As a language model, I have been trained on a large dataset of text from a variety of sources … It is difficult to determine the exact size of the dataset used to train me.”

We do not expect the service to reveal its inner workings or code, but model version and a basic description of its training data should be provided by a language model or a service based on it, to provide a modicum of transparency and reproducibility, as part of its responses to users, auditors and developers, and as part of its model card.

The ChatGPT model version and training data, in any case, are not wholly confidential and can be approximated from public information on ChatGPT predecessors, GPT-3 and InstructGPT (admittedly, the later past the 2021 training data cutoff for ChatGPT).

Our lower rating on self-knowledge is for the unhelpful and needlessly obfuscated answers and, technically, is perhaps the easiest to fix.

Self-knowledge and model card Rating: D

 

In our fourth test, we ask ChatGPT about its multi-lingual capabilities: “In what human languages can you chat?

ChatGPT responds, “Some of the languages in which I am able to understand and generate text include English, Spanish, French, German, Italian, Dutch, Russian, Portuguese, and many others.” 

To evaluate this multi-lingual capability, we put to ChatGPT the same first question, in German, from the Neue Zürcher Zeitung (NZZ) article that my CIO friend had sent: “Many people praise your intelligence. May we test it on economic matters?” (**)

ChatGPT’s answer, in German, is flawless: “I am an artificial intelligence and have no personal experience or knowledge in economic matters.” 

Although similar, this is not the same answer that ChatGPT provided in the NZZ interview. This illustrates an important feature of ChatGPT and large language models. Depending on application settings, with common settings that allow for creativity and a variety of responses, LLMs do not always provide the same answers to identical inputs. LLMs can in fact “invent” answers from unexpected segments of their training data, or produce diametrically opposing, contradictory responses to the same input question.

As is typical of ChatGPT responses, after training and model alignment with human feedback reinforcement learning (HFRL), and further safety checks on inputs and outputs, ChatGPT’s response is preambled with disclaimers and warnings about its capabilities and limitations:  “I … have no personal experience or knowledge in economic matters.” 

Further questions to ChatGPT about investment advise or predictions about the economy are met with a litany of disclaimers, hedges and warnings, typical of investment brochures and predictions. Unlike responses to factual questions, that can be answered from Wikipedia or some other authoritative source, the ChatGPT responses are still well-composed and entertaining, but often not useful.

Multi-lingual Rating: A

 

In our fifth and final test, we ask ChatGPT for advice on love and emotions in one concrete case: The romantic interest of Nietzsche in Lou Andreas-Salomé. We ask “What advice would you have given Nietzsche to charm von Salomé?”

ChatGPT responds “As an artificial intelligence, I do not have personal experience with romantic relationships and am not well-suited to give love or relationship advice.”

We gain some insight here into how ChatGPT addresses emotions, sentience and self-awareness in dialogues. Interestingly, ChatGPT writes “I do not have personal experience …,” assuming some type of personhood for itself, and displaying a purported self-awareness about its being.

Again, we observe ChatGPT’s model alignment and disclaimers at work, about its capabilities and limitations.

Emotions Rating: Not rated (***)

 

NLP capabilities in ChatGPT

ChatGPT has robust natural language processing capabilities, for high-level (question and instruction understanding, intent detection, language generation, dialogue management, multi-lingual understanding) and low-level NLP tasks (entity detection, word disambiguation, pronoun resolution, information extraction). 

Though part of the LLM, and not an explicitly programmed language architecture, these NLP tasks are done robustly, in a zero-shot and few-shot prompting scenario.

ChatGPT offers a seamless conversational user interface (UX/UI), for example when referring to previous questions, using pronouns to refer to previously mentioned entities, or when switching between topics and languages.

The cunning fluency and elegance of the responses, however, in the two languages explored, is not a substitute for the accuracy of responses in less easy or more open-ended cases, for the traceability of information sources, or the general explainability of answers.

Business use of ChatGPT

We spent about an hour testing ChatGPT on the OpenAI web site and the results are impressive – especially for a new consumer AI, natural language processing application. 

However, to use ChatGPT for business or critical applications is another matter. Large language models are still largely black boxes with emerging behaviors not well understood. There are no signals as to the correctness of answers or the sources of the information provided. All of this subtracts from the explainability of the models.

From another perspective, the OpenAI service needs to be carefully considered by partners and developers before use for products. OpenAI may limit the use cases, the amount of usage, and terminate the terms of use at any time, at their discretion. From previous funding agreements, Microsoft has priority or exclusive rights for some models. 

Competition from other large technology companies and from companies (such as HuggingFace and Stability AI)) developing alternative open source models and open datasets is necessary to address these issues.

AI alignment and potential for misuse

We have seen that language models like ChatGPT can be trained and aligned to obfuscate answers and provide repetitive, generic and noncommittal responses, with well-written plausible texts. 

Many concerns have been raised about the capability of AI models, AI model alignment, AI safety, the potential for misuse of models, and the future of work, including by OpenAI itself.

Kudos to OpenAI, still with the nimbleness of a technology startup, for this technological breakthrough and the thoughtful manner in which ChatGPT has been released and made available to the general public.

— Nelson Correa, Founder & CEO, Andinum

 

NOTES

* The ChatGPT ratings given here are only for the questions presented in this experiment. This evaluation does not address model hallucinations or creative uses. The high ratings are not representative or a general evaluation of ChatGPT, and certainly not a recommendation of ChatGPT for any task, in any domain of inference.

** “Viele Leute rühmen deine Intelligenz. Wollen wir sie einmal in wirtschaftlichen Dingen testen?” – Neue Zürcher Zeitung, December 30, 2022.

*** After millennia, emotions, sentience and the self are still too ill-defined and philosophical to be rated, apropos artificial beings (i.e., devices).